Haiyi Mei1, Chi Sing Leung2, Ziwei Liu4, Lei Yang1, 5, Zhongang Cai✉, 1, 4, 5,
3International Digital Economy Academy (IDEA),
4S-Lab, Nanyang Technological University, 5Shanghai AI Laboratory
AiOS performs human localization and SMPL-X estimation in a progressive manner. It is composed of (1) the body localization stage that predicts coarse human location; (2) the Body refinement stage that refines body features and produces face and hand locations; (3) the Whole-body Refinement stage that refines whole-body features and regress SMPL-X parameters.
- download all datasets
- process all datasets into HumanData format. We provided the proccessed npz file, which can be download from here.
- download SMPL-X
- download AiOS checkpoint
The file structure should be like:
AiOS/
├── config/
└── data
├── body_models
| ├── smplx
| | ├──MANO_SMPLX_vertex_ids.pkl
| | ├──SMPL-X__FLAME_vertex_ids.npy
| | ├──SMPLX_NEUTRAL.pkl
| | ├──SMPLX_to_J14.pkl
| | ├──SMPLX_NEUTRAL.npz
| | ├──SMPLX_MALE.npz
| | └──SMPLX_FEMALE.npz
| └── smpl
| ├──SMPL_FEMALE.pkl
| ├──SMPL_MALE.pkl
| └──SMPL_NEUTRAL.pkl
├── preprocessed_npz
│ └── cache
| ├──agora_train_3840_w_occ_cache_2010.npz
| ├──bedlam_train_cache_080824.npz
| ├──...
| └──coco_train_cache_080824.npz
├── checkpoint
│ └── aios_checkpoint.pth
├── datasets
│ ├── agora
| │ └──3840x2160
│ │ ├──train
│ │ └──test
│ ├── bedlam
│ │ ├──train_images
│ │ └──test_images
│ ├── ARCTIC
│ │ ├──s01
│ │ ├──s02
│ │ ├──...
│ │ └──s10
│ ├── EgoBody
│ │ ├──egocentric_color
│ │ └──kinect_color
│ └── UBody
| └──images
└── checkpoint
├── edpose_r50_coco.pth
└── aios_checkpoint.pth
# Create a conda virtual environment and activate it.
conda create -n aios python=3.8 -y
conda activate aios
# Install PyTorch and torchvision.
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
# Install Pytorch3D
git clone -b v0.6.1 https://github.com/facebookresearch/pytorch3d.git
cd pytorch3d
pip install -v -e .
cd ..
# Install MMCV, build from source
git clone -b v1.6.1 https://github.com/open-mmlab/mmcv.git
cd mmcv
export MMCV_WITH_OPS=1
export FORCE_MLU=1
pip install -v -e .
cd ..
# Install other dependencies
conda install -c conda-forge ffmpeg
pip install -r requirements.txt
# Build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..- Place the mp4 video for inference under
AiOS/demo/ - Prepare the pretrained models to be used for inference under
AiOS/data/checkpoint - Inference output will be saved in
AiOS/demo/{INPUT_VIDEO}_out
# CHECKPOINT: checkpoint path
# INPUT_VIDEO: input video path
# OUTPUT_DIR: output path
# NUM_PERSON: num of person. This parameter sets the expected number of persons to be detected in the input (image or video).
# The default value is 1, meaning the algorithm will try to detect at least one person. If you know the maximum number of persons
# that can appear simultaneously, you can set this variable to that number to optimize the detection process (a lower threshold is recommended as well).
# THRESHOLD: socre threshold. This parameter sets the score threshold for person detection. The default value is 0.5.
# If the confidence score of a detected person is lower than this threshold, the detection will be discarded.
# Adjusting this threshold can help in filtering out false positives or ensuring only high-confidence detections are considered.
# GPU_NUM: GPU num.
sh scripts/inference.sh {CHECKPOINT} {INPUT_VIDEO} {OUTPUT_DIR} {NUM_PERSON} {THRESHOLD} {THRESHOLD}
# For inferencing short_video.mp4 with output directory of demo/short_video_out
sh scripts/inference.sh data/checkpoint/aios_checkpoint.pth short_video.mp4 demo 2 0.1 8| NMVE | NMJE | MVE | MPJPE | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DATASETS | FB | B | FB | B | FB | B | F | LH/RH | FB | B | F | LH/RH |
| BEDLAM | 87.6 | 57.7 | 85.8 | 57.7 | 83.2 | 54.8 | 26.2 | 28.1/30.8 | 81.5 | 54.8 | 26.2 | 25.9/28.0 |
| AGORA-Test | 102.9 | 63.4 | 100.7 | 62.5 | 98.8 | 60.9 | 27.7 | 42.5/43.4 | 96.7 | 60.0 | 29.2 | 40.1/41.0 |
| AGORA-Val | 105.1 | 60.9 | 102.2 | 61.4 | 100.9 | 60.9 | 30.6 | 43.9/45.6 | 98.1 | 58.9 | 32.7 | 41.5/43.4 |
a. Make test_result dir
mkdir test_resultb. AGORA Validatoin
Run the following command and it will generate a 'predictions/' result folder which can evaluate with the agora evaluation tool
sh scripts/test_agora_val.sh data/checkpoint/aios_checkpoint.pth agora_valb. AGORA Test Leaderboard
Run the following command and it will generate a 'predictions.zip' which can be submitted to AGORA Leaderborad
sh scripts/test_agora.sh data/checkpoint/aios_checkpoint.pth agora_testc. BEDLAM
Run the following command and it will generate a 'predictions.zip' which can be submitted to BEDLAM Leaderborad
sh scripts/test_bedlam.sh data/checkpoint/aios_checkpoint.pth bedlam_testSome of the codes are based on MMHuman3D, ED-Pose and SMPLer-X.
@InProceedings{Sun_2024_CVPR,
author = {Sun, Qingping and Wang, Yanjun and Zeng, Ailing and Yin, Wanqi and Wei, Chen and Wang, Wenjia and Mei, Haiyi and Leung, Chi-Sing and Liu, Ziwei and Yang, Lei and Cai, Zhongang},
title = {AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {1834-1843}
}
Explore More Motrix Projects
- [SMPL-X] [TPAMI'25] SMPLest-X: An extended version of SMPLer-X with stronger foundation models.
- [SMPL-X] [NeurIPS'23] SMPLer-X: Scaling up EHPS towards a family of generalist foundation models.
- [SMPL-X] [ECCV'24] WHAC: World-grounded human pose and camera estimation from monocular videos.
- [SMPL-X] [NeurIPS'23] RoboSMPLX: A framework to enhance the robustness of whole-body pose and shape estimation.
- [SMPL-X] [ICML'25] ADHMR: A framework to align diffusion-based human mesh recovery methods via direct preference optimization.
- [SMPL-X] MKA: Full-body 3D mesh reconstruction from single- or multi-view RGB videos.
- [SMPL] [ICCV'23] Zolly: 3D human mesh reconstruction from perspective-distorted images.
- [SMPL] [IJCV'26] PointHPS: 3D HPS from point clouds captured in real-world settings.
- [SMPL] [NeurIPS'22] HMR-Benchmarks: A comprehensive benchmark of HPS datasets, backbones, and training strategies.
- [SMPL-X] [ICLR'26] ViMoGen: A comprehensive framework that transfers knowledge from ViGen to MoGen across data, modeling, and evaluation.
- [SMPL-X] [ECCV'24] LMM: Large Motion Model for Unified Multi-Modal Motion Generation.
- [SMPL-X] [NeurIPS'23] FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing.
- [SMPL] InfiniteDance: A large-scale 3D dance dataset and an MLLM-based music-to-dance model designed for robust in-the-wild generalization.
- [SMPL] [NeurIPS'23] InsActor: Generating physics-based human motions from language and waypoint conditions via diffusion policies.
- [SMPL] [ICCV'23] ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model.
- [SMPL] [TPAMI'24] MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model.