PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit
A continual-learning framework and benchmark suite for adapting 2D human pose estimators across domain shifts and evolving skeletons.
PoseAdapt is publicly available as an open-source toolkit. The paper reports experiments from the wacv-2026-camera-ready branch of the repository.
Tracks
2
Backbone
RTMPose-t (~3M)
Budget
1k imgs / 10 epochs
Skeleton Growth
17 → 142 kpts
PoseAdapt formalizes continual pose adaptation under fixed-capacity, no-replay constraints and evaluates it across domain and skeleton shifts.
Abstract
Human pose estimators are typically retrained from scratch or naively fine-tuned whenever keypoint sets, sensing modalities, or deployment domains change—an inefficient, compute-intensive practice that rarely matches field constraints. We present PoseAdapt, an open-source framework and benchmark suite for continual pose model adaptation. PoseAdapt defines domain-incremental and class-incremental tracks that simulate realistic changes in density, lighting, and sensing modality, as well as skeleton growth. The toolkit supports two workflows: (i) Strategy Benchmarking, which lets researchers implement continual learning (CL) methods as plugins and evaluate them under standardized protocols; and (ii) Model Adaptation, which allows practitioners to adapt strong pretrained models to new tasks with minimal supervision. We evaluate representative regularization-based methods in single-step and sequential settings. Benchmarks enforce a fixed lightweight backbone, no access to past data, and tight per-step budgets. This isolates adaptation strategy effects, highlighting the difficulty of maintaining accuracy under strict resource limits. PoseAdapt connects modern CL techniques with practical pose estimation needs, enabling adaptable models that improve over time without repeated full retraining.
Highlights
Standardizes domain-incremental and class-incremental pose adaptation with fixed backbones, no past-data replay, and shared retention metrics.
Provides a plug-in framework on top of MMPose with explicit initialization, adaptation, and finalization stages plus head expansion for new keypoints.
Reference baselines show LFL is the most reliable under photometric shifts, while RGB-to-depth adaptation remains an open challenge.
Video Presentation
Slides
Overview
PoseAdapt treats changing pose-estimation requirements as a continual learning problem. Instead of retraining a model from scratch for every new domain or naively fine-tuning it until it forgets earlier capabilities, PoseAdapt exposes a controlled adaptation loop with fixed compute, fixed model capacity, and no replay of past data.
The paper targets a practical gap in human pose estimation systems. Real deployments rarely remain stationary: illumination changes, crowding increases, sensing modality can shift, and application-specific skeletons may grow over time. Standard static training pipelines do not provide a principled way to adapt under these conditions without either resource waste or catastrophic forgetting.
PoseAdapt addresses this with two complementary views of adaptation. The first is a benchmarking workflow for researchers, where continual learning strategies can be compared fairly under shared protocols. The second is a model-adaptation workflow for practitioners, where a strong pretrained estimator can be specialized to new data without discarding previous competence.
The project page below summarizes the framework, the benchmark design, and the main findings reported in the paper. It focuses on what makes PoseAdapt useful as both an engineering toolkit and a research testbed for sustainable pose estimation.
What PoseAdapt standardizes
- Fixed lightweight backbones for fair comparison.
- No access to old training data during later experiences.
- Tight per-experience budgets to reflect deployment constraints.
- Shared continual-learning metrics for retention and forgetting.
- Support for both domain shifts and skeleton growth.
Framework
1. Initialization
Each new experience starts by preparing the model for adaptation. PoseAdapt can create a frozen teacher snapshot for regularization-based methods such as LFL and LwF, or preserve parameter anchors and importance statistics for EWC. In class-incremental settings, the prediction head is expanded to support newly introduced keypoints while keeping existing channels intact.
2. Adaptation
The current experience is optimized with the task loss plus a strategy-dependent regularizer. The paper evaluates four reference strategies: naive fine-tuning (FT), Elastic Weight Consolidation (EWC), Less-Forgetful Learning (LFL), and Learning without Forgetting (LwF). This isolates continual-learning behavior from architectural changes.
3. Finalization
After each experience, PoseAdapt stores only the compact state needed for the next step: updated teacher snapshots for distillation-based methods or Fisher-style importance estimates for EWC. Past training images are not stored or replayed, which keeps the protocol aligned with memory and privacy constraints.
4. Protocol and metrics
All domain-incremental experiments use the same RTMPose-t reference model with about 3M parameters, ground-truth detection boxes, and a strict budget of at most 1,000 labeled images and 10 epochs per experience. Reporting combines AP with two continual-learning metrics: RA for final multi-experience retention and AF for average forgetting.
Why the framework matters
PoseAdapt is deliberately conservative in what it allows to change. The backbone remains fixed, replay is disallowed, and the budget is small. This makes the benchmark difficult, but it also makes it informative: if a method improves under these constraints, the gain is much more likely to reflect better adaptation strategy rather than hidden increases in capacity or access to old data.
Benchmarks
Domain-incremental track
| Benchmark | Experiences | How the shift is created |
|---|---|---|
| Density | O5, O10, O20 | COCO images are filtered by crowd level and combined with fixed-budget cutout occlusion affecting roughly 5%, 10%, and 20% of the image. |
| Lighting | WL, LL, VLL, ELL | Low-light images are selected by brightness scoring, then progressively darker variants are synthesized through controlled photometric degradation. |
| Modality | RGB, Gray, Depth | Grayscale is obtained by desaturation with light perturbations; Depth is produced using MiDaS relative-depth predictions tiled to three channels. |
Class-incremental track: PoseAdapt-BodyParts
The class-incremental benchmark isolates skeleton growth without any domain shift by using COCO-based training images throughout. The model must gradually extend its output space while retaining earlier keypoints.
| Experience | Keypoint set | Total |
|---|---|---|
| E1 | Body | 17 |
| E2 | + Feet | 23 |
| E3 | + Face | 91 |
| E4 | + Hands | 133 |
| E5 | + Spine | 142 |
This benchmark is defined and supported by the framework, but explicit quantitative evaluation is intentionally left for future work in the paper so that the experimental focus stays on domain shifts.
Reference setting. All reported domain-incremental experiments start from the same off-the-shelf RTMPose-t model pretrained on COCO and AIC. The reference domain is the well-lit COCO validation distribution, where the model reaches 70.06 AP before any continual adaptation.
Main results
Top-line benchmark summary
| Benchmark | Setting | Best reported outcome |
|---|---|---|
| Density | Single-step, light occlusion | LwF: 56.88 RA with 13.16 AF |
| Density | Sequential O5 → O10 → O20 | LFL: 51.02 RA; LwF: lowest AF at 5.97 |
| Lighting | Single-step LL / VLL / ELL | LFL: 57.81 / 52.82 / 41.57 RA with the strongest stability overall |
| Lighting | Sequential WL → LL → VLL → ELL | LFL: 42.15 RA |
| Modality | Single-step Gray | LFL: 53.46 AP on Gray and 51.13 RA |
| Modality | Single-step Depth | LwF: 37.49 AP on Depth |
| Modality | Sequential RGB → Gray → Depth | EWC: best RA at 20.57, but all methods collapse substantially |
Density is the mildest shift
Under increased crowding and cutout occlusion, forgetting remains moderate. LwF is slightly stronger for the easiest density step, while LFL becomes the more reliable strategy as the shift gets harder and in the full sequential setting.
Lighting stresses stability
As illumination drops from well-lit to extremely low light, performance degrades more sharply and the stability–plasticity trade-off becomes harder to manage. Among the tested regularizers, LFL is consistently the strongest under photometric degradation.
Depth remains unsolved
The modality benchmark is the most severe. Grayscale adaptation is manageable, but RGB-to-depth transfer causes a strong collapse in retained RGB performance. The paper uses this benchmark to show that regularization alone is not enough for robust cross-sensor adaptation.
Overall takeaway. Naive fine-tuning is brittle under the paper’s 1k-image / 10-epoch budget. LFL is the most reliable choice across photometric shifts, LwF is often the most plastic on the newest target domain, and EWC can preserve earlier domains slightly better in the modality sequence. None of the tested methods resolves the RGB-to-depth gap.
Resources
Paper and code
The project is available through the PoseAdapt repository, together with the arXiv preprint and the WACV paper PDF.
User documentation
Setup and usage instructions are maintained in the repository documentation for both Strategy Benchmarking and Model Adaptation workflows.
Benchmark package
Benchmark assets and supplementary material are linked through PoseAdaptBench.
Talk and slides
The recorded presentation and the embedded slides below summarize the framework, benchmark design, and experimental takeaways in a compact format.
Implementation note
The paper explicitly notes that the reported WACV results were run from the wacv-2026-camera-ready branch. Keep that in mind when matching exact paper numbers against the evolving repository.
Intended use
PoseAdapt is best used as a controlled benchmark for continual pose adaptation research and as an engineering scaffold for adapting pretrained top-down 2D pose estimators to new domains or new skeleton definitions under limited supervision.
Limitations and future directions
- Most domain shifts are synthetic. This keeps the benchmark controllable, but it does not fully reproduce real sensor noise, motion artefacts, or ecological variability.
- The benchmark fixes the backbone to isolate adaptation strategy effects, so it does not test whether architectural changes would be more robust under severe shifts.
- The current study only covers 2D single-frame pose estimation. It does not address temporal consistency, video adaptation, or 3D pose estimation.
- Skeleton growth is supported in the framework, but the paper leaves class-incremental quantitative evaluation to future work.
- Joint domain-and-keypoint evolution in one real-world deployment stream is not yet benchmarked.
Poster
Poster overview for the PoseAdapt framework, benchmarks, and main continual-learning results.
Acknowledgement
This work was co-funded by the European Union’s Horizon Europe research and innovation programme under Grant Agreement No. 101135724 (LUMINOUS) and Grant Agreement No. 101092889 (SHARESPACE).
BibTeX
@inproceedings{khan2026poseadapt,
title={PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit},
author={Khan, Muhammad Saif Ullah and Stricker, Didier},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month={March},
year={2026},
pages={6840--6850}
}
Maintained by saifkhichi96 on GitHub.
The website is distributed under different open-source licenses. For more details, see the notice at the bottom of the page.