This repository presents the implementation of several Pose Estimation models, a core task in Computer Vision that enables machines to infer the position and orientation of humans, animals, or objects in images and videos by identifying specific points, commonly referred to as keypoints or landmarks. These keypoints can represent joints, limbs, facial features, or other distinctive parts.
Pose estimation methods generally follow two main approaches: bottom-up and top-down.
- In the bottom-up approach, the model first detects all individual keypoints across the entire image using a probabilistic map (heatmap) to estimate the likelihood that each pixel corresponds to a specific keypoint. Non-maximum suppression is then applied to select the most confident candidates. This approach is efficient but often less accurate.
- In the top-down approach, the model first detects bounding boxes for each instance and then predicts the keypoints within them. This method provides higher accuracy and scale invariance but is computationally expensive, especially as the number of instances increases.
Recent models such as YOLO11-pose combine the strengths of both approaches. By avoiding manual grouping and heatmap generation, they retain the efficiency of bottom-up methods, while simultaneously leveraging the precision of top-down pipelines by detecting instances and estimating poses in a single unified process.
Pose estimation has become a cornerstone in multiple domains, including:
- Healthcare and rehabilitation: motion analysis for physiotherapy, remote patient monitoring, and detection of postural anomalies.
- Sports and performance: athlete technique evaluation, automated repetition counting, and injury prevention via real-time posture correction.
- Animal research: behavioral studies, species monitoring in the wild, and welfare assessment in farms or labs.
- Human-computer interaction (HCI): gesture-based interfaces, augmented reality, and touchless controls.
- Surveillance and safety: suspicious behavior recognition and fall detection in sensitive environments such as hospitals or care facilities.
- Entertainment and media: motion capture, animation, and video games.
- Industry and robotics: human-robot collaboration, ergonomics in assembly lines, and task assistance in manufacturing.
All projects leverage transfer learning, fine-tuning pretrained models on large-scale datasets with frameworks such as TensorFlow, PyTorch, and Ultralytics.
- Basic models were fine-tuned on single-class datasets with one instance per image, following a bottom-up, heatmap-based approach.
- Advanced models (YOLO11-pose), designed for real-time applications, were trained on multi-class, multi-instance datasets.
Training is carried out in Google Colab using TPUs or GPUs, depending on project requirements.
All notebooks incorporate data augmentation to improve generalization, either manually with Albumentations or automatically (e.g., in YOLO11-pose). Additionally, callbacks and learning rate schedulers are used to prevent overfitting and enhance performance.
Below are the evaluation results of the models implemented so far. When validation or test sets were not publicly available, evaluations were performed only on the accessible split.
Dataset | Task | Model | Eval. Set | ||||
---|---|---|---|---|---|---|---|
AP-10K | Multi-species animal pose estimation | YOLO11l-pose | 0.951 / 0.938 | 0.799 / 0.788 | 0.901 / 0.874 | 0.589 / 0.575 | Validation / Test |
OpenThermalPose2 | Human pose estimation | YOLO11l-pose | 0.995 / 0.995 | 0.979 / 0.967 | 0.991 / 0.987 | 0.94 / 0.934 | Validation / Test |
OneHand10K | Hand pose estimation | YOLO11s-pose | 0.995 | 0.816 | 0.954 | 0.519 | Test |
Dataset | Task | Model | Eval. Set | ||
---|---|---|---|---|---|
CUB-200-2011 | Animal pose estimation | ConvNeXt-Base U-Net | 0.929 | 0.938 | Test |
COFW | Face landmark estimation | ConvNeXt-Base U-Net | - | 0.957 | Test |