Conference Paper WACV Human Pose Estimation Continual Learning Benchmarking Domain Adaptation

WACV 2026 Conference paper

PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit

A continual-learning framework and benchmark suite for adapting 2D human pose estimators across domain shifts and evolving skeletons.

Muhammad Saif Ullah Khan , Didier Stricker

German Research Center for Artificial Intelligence (DFKI)

PoseAdapt is publicly available as an open-source toolkit. The paper reports experiments from the wacv-2026-camera-ready branch of the repository.

Paper arXiv Code Benchmarks Docs

Tracks

Backbone

RTMPose-t (~3M)

Budget

1k imgs / 10 epochs

Skeleton Growth

17 → 142 kpts

PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit

PoseAdapt formalizes continual pose adaptation under fixed-capacity, no-replay constraints and evaluates it across domain and skeleton shifts.

Abstract

Human pose estimators are typically retrained from scratch or naively fine-tuned whenever keypoint sets, sensing modalities, or deployment domains change—an inefficient, compute-intensive practice that rarely matches field constraints. We present PoseAdapt, an open-source framework and benchmark suite for continual pose model adaptation. PoseAdapt defines domain-incremental and class-incremental tracks that simulate realistic changes in density, lighting, and sensing modality, as well as skeleton growth. The toolkit supports two workflows: (i) Strategy Benchmarking, which lets researchers implement continual learning (CL) methods as plugins and evaluate them under standardized protocols; and (ii) Model Adaptation, which allows practitioners to adapt strong pretrained models to new tasks with minimal supervision. We evaluate representative regularization-based methods in single-step and sequential settings. Benchmarks enforce a fixed lightweight backbone, no access to past data, and tight per-step budgets. This isolates adaptation strategy effects, highlighting the difficulty of maintaining accuracy under strict resource limits. PoseAdapt connects modern CL techniques with practical pose estimation needs, enabling adaptable models that improve over time without repeated full retraining.

Highlights

Standardizes domain-incremental and class-incremental pose adaptation with fixed backbones, no past-data replay, and shared retention metrics.

Provides a plug-in framework on top of MMPose with explicit initialization, adaptation, and finalization stages plus head expansion for new keypoints.

Reference baselines show LFL is the most reliable under photometric shifts, while RGB-to-depth adaptation remains an open challenge.

Video Presentation

Slides

Overview Framework Benchmarks Results Resources Limitations Poster BibTeX

Overview

PoseAdapt treats changing pose-estimation requirements as a continual learning problem. Instead of retraining a model from scratch for every new domain or naively fine-tuning it until it forgets earlier capabilities, PoseAdapt exposes a controlled adaptation loop with fixed compute, fixed model capacity, and no replay of past data.

The paper targets a practical gap in human pose estimation systems. Real deployments rarely remain stationary: illumination changes, crowding increases, sensing modality can shift, and application-specific skeletons may grow over time. Standard static training pipelines do not provide a principled way to adapt under these conditions without either resource waste or catastrophic forgetting.

PoseAdapt addresses this with two complementary views of adaptation. The first is a benchmarking workflow for researchers, where continual learning strategies can be compared fairly under shared protocols. The second is a model-adaptation workflow for practitioners, where a strong pretrained estimator can be specialized to new data without discarding previous competence.

The project page below summarizes the framework, the benchmark design, and the main findings reported in the paper. It focuses on what makes PoseAdapt useful as both an engineering toolkit and a research testbed for sustainable pose estimation.

What PoseAdapt standardizes

Fixed lightweight backbones for fair comparison.
No access to old training data during later experiences.
Tight per-experience budgets to reflect deployment constraints.
Shared continual-learning metrics for retention and forgetting.
Support for both domain shifts and skeleton growth.

Framework

1. Initialization

Each new experience starts by preparing the model for adaptation. PoseAdapt can create a frozen teacher snapshot for regularization-based methods such as LFL and LwF, or preserve parameter anchors and importance statistics for EWC. In class-incremental settings, the prediction head is expanded to support newly introduced keypoints while keeping existing channels intact.

2. Adaptation

The current experience is optimized with the task loss plus a strategy-dependent regularizer. The paper evaluates four reference strategies: naive fine-tuning (FT), Elastic Weight Consolidation (EWC), Less-Forgetful Learning (LFL), and Learning without Forgetting (LwF). This isolates continual-learning behavior from architectural changes.

3. Finalization

After each experience, PoseAdapt stores only the compact state needed for the next step: updated teacher snapshots for distillation-based methods or Fisher-style importance estimates for EWC. Past training images are not stored or replayed, which keeps the protocol aligned with memory and privacy constraints.

4. Protocol and metrics

All domain-incremental experiments use the same RTMPose-t reference model with about 3M parameters, ground-truth detection boxes, and a strict budget of at most 1,000 labeled images and 10 epochs per experience. Reporting combines AP with two continual-learning metrics: RA for final multi-experience retention and AF for average forgetting.

Why the framework matters

PoseAdapt is deliberately conservative in what it allows to change. The backbone remains fixed, replay is disallowed, and the budget is small. This makes the benchmark difficult, but it also makes it informative: if a method improves under these constraints, the gain is much more likely to reflect better adaptation strategy rather than hidden increases in capacity or access to old data.

Benchmarks

Domain-incremental track

Benchmark	Experiences	How the shift is created
Density	O5, O10, O20	COCO images are filtered by crowd level and combined with fixed-budget cutout occlusion affecting roughly 5%, 10%, and 20% of the image.
Lighting	WL, LL, VLL, ELL	Low-light images are selected by brightness scoring, then progressively darker variants are synthesized through controlled photometric degradation.
Modality	RGB, Gray, Depth	Grayscale is obtained by desaturation with light perturbations; Depth is produced using MiDaS relative-depth predictions tiled to three channels.

Class-incremental track: PoseAdapt-BodyParts

The class-incremental benchmark isolates skeleton growth without any domain shift by using COCO-based training images throughout. The model must gradually extend its output space while retaining earlier keypoints.

Experience	Keypoint set	Total
E1	Body	17
E2	+ Feet	23
E3	+ Face	91
E4	+ Hands	133
E5	+ Spine	142

This benchmark is defined and supported by the framework, but explicit quantitative evaluation is intentionally left for future work in the paper so that the experimental focus stays on domain shifts.

Reference setting. All reported domain-incremental experiments start from the same off-the-shelf RTMPose-t model pretrained on COCO and AIC. The reference domain is the well-lit COCO validation distribution, where the model reaches 70.06 AP before any continual adaptation.

Main results

Top-line benchmark summary

Benchmark	Setting	Best reported outcome
Density	Single-step, light occlusion	LwF: 56.88 RA with 13.16 AF
Density	Sequential O5 → O10 → O20	LFL: 51.02 RA; LwF: lowest AF at 5.97
Lighting	Single-step LL / VLL / ELL	LFL: 57.81 / 52.82 / 41.57 RA with the strongest stability overall
Lighting	Sequential WL → LL → VLL → ELL	LFL: 42.15 RA
Modality	Single-step Gray	LFL: 53.46 AP on Gray and 51.13 RA
Modality	Single-step Depth	LwF: 37.49 AP on Depth
Modality	Sequential RGB → Gray → Depth	EWC: best RA at 20.57, but all methods collapse substantially

Density is the mildest shift

Under increased crowding and cutout occlusion, forgetting remains moderate. LwF is slightly stronger for the easiest density step, while LFL becomes the more reliable strategy as the shift gets harder and in the full sequential setting.

Lighting stresses stability

As illumination drops from well-lit to extremely low light, performance degrades more sharply and the stability–plasticity trade-off becomes harder to manage. Among the tested regularizers, LFL is consistently the strongest under photometric degradation.

Depth remains unsolved

The modality benchmark is the most severe. Grayscale adaptation is manageable, but RGB-to-depth transfer causes a strong collapse in retained RGB performance. The paper uses this benchmark to show that regularization alone is not enough for robust cross-sensor adaptation.

Overall takeaway. Naive fine-tuning is brittle under the paper’s 1k-image / 10-epoch budget. LFL is the most reliable choice across photometric shifts, LwF is often the most plastic on the newest target domain, and EWC can preserve earlier domains slightly better in the modality sequence. None of the tested methods resolves the RGB-to-depth gap.

Resources

Paper and code

The project is available through the PoseAdapt repository, together with the arXiv preprint and the WACV paper PDF.

User documentation

Setup and usage instructions are maintained in the repository documentation for both Strategy Benchmarking and Model Adaptation workflows.

Benchmark package

Benchmark assets and supplementary material are linked through PoseAdaptBench.

Talk and slides

The recorded presentation and the embedded slides below summarize the framework, benchmark design, and experimental takeaways in a compact format.

Implementation note

The paper explicitly notes that the reported WACV results were run from the wacv-2026-camera-ready branch. Keep that in mind when matching exact paper numbers against the evolving repository.

Intended use

PoseAdapt is best used as a controlled benchmark for continual pose adaptation research and as an engineering scaffold for adapting pretrained top-down 2D pose estimators to new domains or new skeleton definitions under limited supervision.

Limitations and future directions

Most domain shifts are synthetic. This keeps the benchmark controllable, but it does not fully reproduce real sensor noise, motion artefacts, or ecological variability.
The benchmark fixes the backbone to isolate adaptation strategy effects, so it does not test whether architectural changes would be more robust under severe shifts.
The current study only covers 2D single-frame pose estimation. It does not address temporal consistency, video adaptation, or 3D pose estimation.
Skeleton growth is supported in the framework, but the paper leaves class-incremental quantitative evaluation to future work.
Joint domain-and-keypoint evolution in one real-world deployment stream is not yet benchmarked.

Poster

Poster overview for the PoseAdapt framework, benchmarks, and main continual-learning results.

Acknowledgement

This work was co-funded by the European Union’s Horizon Europe research and innovation programme under Grant Agreement No. 101135724 (LUMINOUS) and Grant Agreement No. 101092889 (SHARESPACE).

BibTeX

@inproceedings{khan2026poseadapt,
  title={PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit},
  author={Khan, Muhammad Saif Ullah and Stricker, Didier},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  month={March},
  year={2026},
  pages={6840--6850}
}

Maintained by saifkhichi96 on GitHub.

The website is distributed under different open-source licenses. For more details, see the notice at the bottom of the page.