The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction
ViDiHand satisfies the target properties of occlusion robustness, accuracy, and temporal smoothness for 4D hand recovery.
Click a case below, then switch between our results and method comparison.
The VACE branch is finetuned with hand-overlay rendering while the base DiT remains frozen, yielding a hand-aware video diffusion model.
A lightweight dual-branch decoder extracts MANO pose, 2D joints, and translation from a single intermediate VACE feature.
At inference, the same feature is decoded in a single VACE pass.
Comparison on three egocentric benchmarks. ARCTIC and HOT3D are in-distribution; HOI4D is a held-out cross-dataset comparison fair to all methods.
| Method | Detection | 3D Pose | Orient. & Position | Temporal | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FAcc ↑ | Recall ↑ | F1 ↑ | MPJPE-p ↓ | PA-p ↓ | EPE-p ↓ | GO-p ↓ | CT-p ↓ | Jitter ↓ | ||
| ARCTIC | InterWild | 0.878 | 0.943 | 0.959 | 30.817 | 15.952 | 53.888 | 25.386 | 0.097 | 46.577 |
| HaMeR | 0.876 | 0.943 | 0.957 | 39.976 | 29.325 | 67.816 | 24.924 | 0.094 | 18.094 | |
| Hamba | 0.833 | 0.912 | 0.942 | 40.965 | 31.300 | 88.839 | 27.822 | 0.110 | 15.026 | |
| WildHands | 0.879 | 0.946 | 0.960 | 25.704 | 13.941 | 50.517 | 22.320 | 0.058 | 12.972 | |
| WiLoR | 0.919 | 0.951 | 0.974 | 37.173 | 26.646 | 71.216 | 17.358 | 0.075 | 23.978 | |
| Dyn-HaMR | 0.841 | 0.917 | 0.951 | 40.172 | 30.849 | 98.813 | 26.006 | 0.121 | 12.506 | |
| HaWoR | 0.700 | 0.818 | 0.895 | 55.677 | 37.182 | 160.912 | 43.320 | 0.149 | 19.735 | |
| OmniHands | 0.866 | 0.949 | 0.954 | 29.674 | 14.203 | 51.505 | 24.580 | 0.087 | 45.312 | |
| ViDiHand (Ours) | 0.997 | 0.999 | 0.999 | 21.668 | 9.821 | 12.407 | 14.642 | 0.047 | 3.183 | |
| HOT3D | InterWild | 0.669 | 0.881 | 0.868 | 77.168 | 24.811 | 71.482 | 58.501 | 0.213 | 101.164 |
| HaMeR | 0.692 | 0.904 | 0.883 | 67.593 | 36.241 | 60.801 | 49.633 | 0.102 | 23.206 | |
| Hamba | 0.632 | 0.829 | 0.853 | 71.054 | 43.342 | 108.502 | 56.535 | 0.128 | 18.111 | |
| WildHands | 0.655 | 0.863 | 0.844 | 52.791 | 28.946 | 111.438 | 53.933 | 0.157 | 22.885 | |
| WiLoR | 0.827 | 0.898 | 0.937 | 44.825 | 35.079 | 69.881 | 25.750 | 0.098 | 17.784 | |
| Dyn-HaMR | 0.558 | 0.761 | 0.755 | 82.865 | 51.921 | 205.428 | 50.700 | 0.583 | 47.483 | |
| HaWoR | 0.348 | 0.499 | 0.655 | 80.146 | 74.957 | 332.311 | 79.350 | 0.262 | 23.806 | |
| OmniHands | 0.649 | 0.895 | 0.868 | 63.281 | 22.682 | 68.437 | 49.120 | 0.133 | 69.510 | |
| ViDiHand (Ours) | 0.948 | 0.974 | 0.983 | 21.514 | 11.383 | 14.953 | 15.829 | 0.040 | 3.741 | |
| HOI4D | InterWild | 0.731 | 0.922 | 0.864 | 53.072 | 22.909 | 80.549 | 41.743 | 0.228 | 98.866 |
| HaMeR | 0.730 | 0.923 | 0.864 | 48.875 | 33.215 | 81.717 | 33.636 | 0.187 | 20.197 | |
| Hamba | 0.709 | 0.885 | 0.849 | 51.698 | 37.395 | 117.801 | 37.466 | 0.204 | 21.740 | |
| WildHands | 0.730 | 0.924 | 0.864 | 45.623 | 23.601 | 82.246 | 45.654 | 0.159 | 18.615 | |
| WiLoR | 0.962 | 0.965 | 0.972 | 41.603 | 27.767 | 43.335 | 25.603 | 0.116 | 17.735 | |
| Dyn-HaMR | 0.749 | 0.861 | 0.844 | 52.826 | 40.172 | 151.902 | 40.306 | 0.258 | 18.146 | |
| HaWoR | 0.868 | 0.863 | 0.918 | 58.442 | 39.665 | 144.229 | 43.209 | 0.140 | 28.098 | |
| OmniHands | 0.655 | 0.937 | 0.834 | 44.255 | 18.689 | 70.662 | 34.392 | 0.108 | 24.212 | |
| ViDiHand (Ours) | 0.984 | 0.991 | 0.990 | 30.090 | 13.960 | 24.460 | 23.420 | 0.117 | 4.010 | |