Kimodo

NVIDIA's Open-Source Kinematic Motion Diffusion Model — Generate Controllable 3D Human & Robot Motion from Text

700hrs MocapHuman + RobotText-to-MotionOpen Source

What Is Kimodo?

Kimodo (Kinematic Motion Diffusion) is an open-source 3D motion generation model developed by NVIDIA Research. Built on a novel two-stage transformer diffusion architecture, Kimodo generates high-quality human and robot motions from simple text descriptions or precise kinematic constraints — all in just 2 to 5 seconds on a single GPU.

Trained on over 700 hours of professional optical motion capture data from the Bones Rigplay dataset, Kimodo represents the largest-scale controllable motion diffusion model available today — roughly 25 times more training data than prior models like MDM or MotionDiffuse. Its two-stage denoiser separates root trajectory prediction from body motion generation, effectively minimizing common artifacts like floating and foot skating that plague other motion generation approaches.

Kimodo supports three skeleton formats: NVIDIA's SOMA parametric human body, the Unitree G1 humanoid robot, and the widely-used SMPL-X model. All SOMA and G1 models are released under the NVIDIA Open Model License, making them freely available for both research and commercial applications. Whether you are building animation pipelines, training robot policies, or prototyping interactive characters, Kimodo provides production-quality motion at the speed of a text prompt.

What Kimodo Can Do

Kimodo Text-to-Motion

Generate high-quality 3D human motion from natural language prompts. Describe actions like "a person walks forward then starts jumping" and Kimodo brings it to life in seconds. Chain multiple text prompts on a timeline to author complex, multi-phase motion sequences with smooth transitions between each action.

Human + Robot Skeletons

Kimodo supports three skeleton formats: NVIDIA's SOMA parametric human body model for production use, the Unitree G1 humanoid robot skeleton for robotics applications, and SMPL-X for full compatibility with existing motion capture and animation pipelines like AMASS.

Kimodo Kinematic Controls

Fine-grained spatial and temporal control through full-body keyframes, end-effector positions and rotations, 2D waypoints, and dense 2D ground paths. Kimodo applies all constraints directly in pose space during the diffusion denoising process, ensuring precise and physically plausible results every time.

Why Choose Kimodo?

Unprecedented Training Scale

Kimodo is trained on 700+ hours of professional studio motion capture data — roughly 25× more than competing models like MDM, MotionDiffuse, or MoMask. This massive training scale delivers superior motion quality, greater diversity, and stronger generalization to novel and complex text prompts.

Native Controllability

Unlike latent-space approaches that require expensive test-time optimization, Kimodo operates directly in explicit pose space. Kinematic constraints including keyframes, end-effectors, waypoints, and dense paths are applied natively during each diffusion step for precise, reliable, artifact-free control.

Multi-Skeleton Support

Generate motion for digital human characters using SOMA or SMPL-X body models, and for humanoid robots using the Unitree G1 skeleton — all from the Kimodo model family. Export as NPZ, MuJoCo CSV, or AMASS format for seamless integration into animation, simulation, and robotics pipelines.

Open Source & Commercial-Friendly

SOMA and G1 model checkpoints are released under the NVIDIA Open Model License, permitting both academic research and commercial deployment. A free HuggingFace Spaces demo lets anyone try Kimodo instantly in the browser — no GPU or installation required.

How Kimodo Works

From text prompt to 3D motion in three simple steps

Step 1

Describe Your Motion

Write a natural language prompt like "a person walks forward, picks up a box, and turns around." Optionally add kinematic constraints such as keyframe poses, end-effector targets, or 2D ground paths for precise spatial control over the generated motion.

Step 2

Generate with Diffusion

Kimodo's two-stage transformer denoiser processes your input. The root denoiser predicts global trajectory first, then the body denoiser generates detailed joint motion. The full process takes just 2–5 seconds on an RTX 3090.

Step 3

Export & Integrate

Download your generated motion as NPZ for general use, MuJoCo CSV for robotics simulation in tools like ProtoMotions, or AMASS format for compatibility with existing animation and research pipelines. Use the interactive timeline UI to refine, iterate, and export multiple variations.

Kimodo See It In Action

The Kimodo interactive demo provides an intuitive timeline interface for authoring complex motions with text prompts and kinematic constraints. Preview generated results in real-time 3D visualization, compare multiple samples side by side, switch between SOMA and G1 characters, and export your motions directly from the browser.

Frequently Asked Questions

Citation

@article{Kimodo2026,
  title={Kimodo: Scaling Controllable Human Motion Generation},
  author={Rempe, Davis and Petrovich, Mathis and Yuan, Ye and Zhang, Haotian and Peng, Xue Bin and Jiang, Yifeng and Wang, Tingwu and Iqbal, Umar and Minor, David and de Ruyter, Michael and Li, Jiefeng and Tessler, Chen and Lim, Edy and Jeong, Eugene and Wu, Sam and Hassani, Ehsan and Huang, Michael and Yu, Jin-Bey and Chung, Chaeyeon and Song, Lina and Dionne, Olivier and Kautz, Jan and Yuen, Simon and Fidler, Sanja},
  journal={arXiv},
  year={2026}
}