The Large Movement Model — A Foundation Model for Human Motion

1. Introduction

Human movement is one of the richest natural signals available. It encodes health status, task intent, emotional state, motor skill, and physical capability. Yet despite decades of work in biomechanics, computer vision, and machine learning, no unified AI framework exists that treats motion as a general-purpose sequence domain — one where the same foundational model can serve rehabilitation assessment, sports analytics, robotics control, and human-computer interaction.

The Large Movement Model (LMM) addresses this gap. LMM applies the same core insight that drives large language models — that sequence-learning architectures, given consistent tokenization and sufficient data, can discover generalizable latent structure — to human pose sequences. Where language models learn grammar and semantics from words, LMM learns biomechanical structure, temporal coordination, and movement intent from the evolution of body joint positions over time.

This document describes the LMM pipeline architecture, model designs, and Phase I experimental results. The work demonstrates that a multi-resolution attention architecture and a diffusion-based generative model both substantially outperform conventional baselines on motion forecasting tasks, establishing feasibility for a new class of embodied foundation models.

2. Data Pipeline

The LMM pipeline transforms raw video into structured, normalized, machine-learning-ready motion tokens. It is designed to ingest data from heterogeneous sources — different cameras, frame rates, resolutions, skeleton formats, and recording conditions — and produce a unified representation suitable for model training and evaluation.

2.1 Pose Estimation

Video inputs are processed through OpenPose to extract BODY_25 skeletal keypoints: 25 major body joints with per-joint confidence scores. For datasets provided as pre-extracted skeletons (COCO-17, Kinect V2, Vicon), the pipeline includes format converters that map alternative joint schemas to BODY_25 through documented correspondence tables, synthesizing derived joints (neck, mid-hip) where needed. BODY_25 was selected as the Phase I standard because it balances representational fidelity with computational efficiency at roughly 1/100th the cost of mesh-based representations like SMPL.

2.2 Normalization

Raw keypoint sequences vary in scale, position, and temporal resolution. The normalization stage removes these extraneous variations:

Translation invariance: Every skeleton is centered on the hip midpoint.
Scale invariance: Skeletons are scaled to a standardized torso length.
Temporal uniformity: Sequences are resampled to 15 FPS regardless of source frame rate.
Derived features: Per-joint velocity, acceleration, and jerk are computed alongside positions.
Quality control: Each clip receives a QC metadata record with per-joint confidence statistics, detected artifacts, and usability classification.

2.3 Multi-Resolution Tokenization

Motion data is structured into three simultaneous token representations. All three reconstruct losslessly to the original — no information is discarded, only reorganized.

Token Level	Shape	What It Captures	Resolution
Frame	(T, 75)	Full body pose at each timestep	66 ms
Window	(T/5, 5, 75)	Groups of 5 frames — movement phrases	330 ms
Body-Part	(T, 5, 21)	5 anatomical regions: core, R/L arm, R/L leg	66 ms (spatial)

The frame level provides the finest temporal grain. The window level captures meso-scale patterns (a step, an arm swing). The body-part level captures spatial coordination between anatomical regions. Together they give a model access to movement structure at multiple scales without committing to which scale matters most.

2.4 Pipeline Performance

The pipeline has been validated on over 17,000 clips from five independent sources spanning multi-camera studio video, clinical rehabilitation video, depth-sensor skeleton data, and optical motion capture. Heterogeneous inputs — 60 FPS dance video, 30 FPS clinical footage, Kinect V2 depth skeletons, Vicon optical mocap — all converge to the same 15 FPS BODY_25 normalized representation with zero pipeline failures.

3. Model Architectures

Two architectures were developed and evaluated, each taking 120 frames (8 seconds) of context and producing 30 predicted frames (2 seconds).

3.1 Flat Transformer Baseline

A standard encoder-decoder transformer operating on single-resolution frame tokens. Six encoder layers, six decoder layers, sinusoidal positional encodings based on physical time (seconds, not frame index). This serves as the controlled baseline.

3.2 Hierarchical Temporal Transformer (HTT)

The HTT processes all three token levels simultaneously through three parallel encoder streams connected by bidirectional cross-attention:

Frame encoder (4 layers): Fine-grained temporal dynamics, frame to frame.
Window encoder (3 layers): Meso-scale patterns via learned 1D convolution compression of 5-frame chunks. Self-attention over 24 window tokens instead of 120 frame tokens.
Body-part encoder (3 layers): Spatial coordination across 5 anatomical regions with learned part-type embeddings.

Input: 120 frames of context (8 seconds at 15 FPS) │ ┌───────────┼───────────┐ ▼ ▼ ▼ Frame Window Body-part tokens tokens tokens (120, 75) (24, 5, 75) (120, 5, 21) │ │ │ Frame enc Window enc Body-part enc (4 layers) (3 layers) (3 layers) │ │ │ └─────┬─────┘ │ Cross-attn (2L) │ frame ↔ window │ │ │ └────────┬────────┘ Cross-attn (2L) frame ↔ body-part │ Gated fusion │ Decoder (4 layers) │ 30 predicted frames

A gated linear combination fuses the three encoder outputs, and a 4-layer decoder generates the forecast autoregressively. Scheduled sampling addresses autoregressive drift by gradually transitioning the decoder from ground-truth to self-predicted inputs during training (0% → 50% over 25 epochs).

3.3 Motion Diffusion Model

Instead of predicting frame by frame, the diffusion model generates the full 30-frame forecast in a single pass by iteratively denoising random noise into a plausible motion trajectory, conditioned on the context.

Context encoder: 6-layer transformer encoder producing a latent representation of the input.
Denoiser: 6-layer transformer encoder, cross-attending to context, predicting clean motion from noise.
Schedule: Cosine noise schedule (T=100 training steps, DDIM 20-step inference).

Because diffusion generates all frames simultaneously, it does not suffer from compounding autoregressive error.

4. Experiments and Results

4.1 Setup

All models were trained on 12,483 clips (~2.5M frames) from a multi-camera dance corpus spanning 10 genres, 30 dancers, and 9 synchronized camera angles at 60 FPS. Same 90/10 split by clip, identical hyperparameters: AdamW, lr 1e-4, batch 32, early stopping with patience 10. Task: 120 frames of context (8 seconds) → predict 30 frames (2 seconds). HTT and baseline evaluated autoregressively; diffusion evaluated as sample mean over 10 DDIM trajectories.

4.2 Results

Model	Params	Frame 1 (0.07s)	Frame 15 (1.0s)	Frame 30 (2.0s)	Overall
Flat Baseline	11.1M	0.087	0.654	0.900	0.637
HTT	18.8M	0.126	0.646	0.785	0.658
Diffusion	11.4M	0.137	0.443	0.621	0.445

Table 1. Motion forecasting MSE (lower is better). 12,483 clips, 53,903 training windows.

4.3 Analysis

Autoregressive drift remains the central challenge. The flat baseline predicts the next frame well (0.087) but degrades to 0.900 by frame 30 — a 10× increase. Even with a large and diverse training corpus, single-resolution attention cannot maintain multi-second coherence.

Multi-resolution attention helps most at the longest horizons. The HTT reduces frame-30 error by 13% (0.900 → 0.785). The advantage is concentrated at the 1.5–2 second horizon where maintaining coherence across body regions matters most. At shorter horizons the simpler baseline is competitive, suggesting that with sufficient training data, architectural complexity matters less for immediate-next-frame prediction.

Diffusion substantially outperforms both deterministic models. Overall MSE is 30% lower than the baseline, and frame-30 error (0.621) is 31% lower than the HTT’s. Because the diffusion model generates all 30 frames simultaneously rather than sequentially, it does not suffer from compounding error — the dominant failure mode of autoregressive architectures.

A note on comparison fairness. The HTT’s MSE comes from a single deterministic rollout. The diffusion model’s MSE is computed over the mean of 10 samples, which reduces variance. The comparison is directionally valid but the magnitude of improvement should be interpreted with this methodological difference in mind.

The two architectures are complementary. The HTT produces a single deterministic trajectory, frame by frame — suitable for real-time applications where latency matters. The diffusion model produces a distribution over possible futures in batch — suitable for offline analysis, uncertainty quantification, and any setting where prediction quality outweighs latency.

5. What the Model Learns

A model that predicts human motion accurately has learned something about how the body works. Several observations from the Phase I experiments:

Temporal coordination across timescales. The HTT’s window encoder captures meso-scale patterns the frame encoder misses: stride cadence, arc-and-return of reaching, weight transfer timing. Cross-attention constrains fine-grained predictions within these broader patterns — which is what reduces drift.

Spatial coordination across body regions. The body-part encoder learns how anatomical regions move relative to each other: contralateral arm-leg coordination in gait, trunk stabilization during limb movement, bilateral symmetry. These relationships emerge from the data and are precisely the kind of structure needed to detect compensatory movement patterns.

Biomechanical plausibility. Predicted sequences generally respect physical constraints: joints stay within anatomically plausible ranges, limb lengths remain approximately constant, center-of-mass trajectories stay continuous. These properties emerge from the data distribution without explicit enforcement through loss terms.

6. Cross-Domain Transfer

The pipeline and model architectures are domain-agnostic by design. The BODY_25 skeleton, normalization procedure, and multi-resolution tokenization make no assumptions about what kind of movement is being analyzed. But does the model actually transfer across domains, or does it just memorize the training distribution?

6.1 Held-Out Generalization Test

To answer this, we evaluated trained models against a held-out corpus from a fundamentally different capture modality: optical motion capture data from physical therapy exercises (Vicon system, 10 exercises, 10 subjects). The model never saw this data during training, and the input characteristics are visually distinct — no camera perspective, no detection jitter, no confidence variation.

Key finding: motion dynamics transfer across domains. Per-frame velocity MSE — which measures whether the model predicts the direction and speed of joint movement correctly — was essentially identical across all tested models (0.011–0.014), regardless of whether they were trained on dance video, clinical rehabilitation footage, or a multi-source combination. Absolute position errors varied by domain (as expected, since coordinate systems and projection geometry differ), but the underlying motion structure transferred.

6.2 Scheduled Sampling as a Transfer Mechanism

An unexpected finding: scheduled sampling — originally introduced to reduce autoregressive drift during training — turns out to be a powerful cross-domain transfer mechanism. Models trained with scheduled sampling showed 54% lower teacher-forced error on held-out data compared to identical models trained without it. The interpretation: scheduled sampling forces the model to predict from imperfect inputs during training, and cross-domain inputs are inherently imperfect (the noise profile, scale, and confidence patterns differ from training). Models that learn to handle imperfect inputs generalize better to new domains.

6.3 Target Domains

Rehabilitation and clinical assessment. Objective quantification of movement quality, recovery tracking, compensatory pattern detection.
Sports performance. Technique assessment, fatigue detection, injury risk estimation, training optimization.
Robotics and embodied AI. Human motion priors for motion planning, human-robot interaction, intent prediction.
Ergonomics and occupational safety. Movement pattern monitoring, workstation assessment, physical demand quantification.
Behavioral analytics. Gait recognition, anomalous movement detection, intent inference from body language.

7. Current Limitations

2D projection. Phase I uses 2D pose estimation. Depth-dependent movements are geometrically compressed. 3D pose estimation or multi-view triangulation would improve coverage.
Training data diversity. The primary training corpus covers 10 dance genres — fast, diverse, full-body movement but not specifically clinical. Domain-specific fine-tuning on rehabilitation exercises is a Phase II objective.
Clinical alignment not yet validated. Prediction accuracy by engineering metrics is strong; correlation with expert-assigned movement quality ratings has not yet been formally evaluated.
Autoregressive drift persists. The HTT reduces drift and scheduled sampling improves robustness, but frame-30 error is still ~6–10× frame-1 error for deterministic models. The diffusion model largely avoids this but requires batch-mode inference.
Evaluation methodology. HTT (single rollout) and diffusion (sample mean of 10) are not evaluated on identical terms. The diffusion model’s advantage includes a variance-reduction benefit from multi-sample averaging.

8. Relation to Prior Work

The individual components of the LMM architecture draw on established techniques: multi-resolution transformers, cross-attention fusion, gated combination layers, scheduled sampling (Bengio et al., 2015), and body-part spatial decomposition. LMM’s contribution is not architectural novelty but the synthesis of these into a coherent framework for motion as a sequence domain, with three distinguishing design decisions:

Standardized cross-source tokenization enabling cross-dataset training that prior work has not attempted at this scope.
Multi-resolution attention matched to the physics of motion — frame, window, and body-part levels correspond to the temporal and spatial scales at which movement is biomechanically organized.
Complementary autoregressive and generative architectures providing a controlled comparison that illuminates the strengths and limitations of each approach.

9. Architecture Summary

Component	HTT	Diffusion
Parameters	18.8M	11.4M
Encoder	3-stream: frame (4L), window (3L), body-part (3L)	Context encoder (6L)
Fusion	Bidirectional cross-attn (4L) + gated combination	Cross-attn in denoiser
Decoder / Generator	4-layer autoregressive decoder	6-layer denoiser, DDIM sampling
Embedding dim	256, 8 heads, FFN 1024	256, 8 heads, FFN 1024
Input	120 frames (8s) — 3 token views	120 frames (8s) — frame tokens
Output	30 frames (2s), sequential	30 frames (2s), single pass
Training time	~4 hours (29 epochs)	~0.5 hours (47 epochs)
Inference	Deterministic, real-time capable	Stochastic, batch-mode

10. Conclusion

The Large Movement Model demonstrates that human motion can be treated as a structured sequence domain analogous to text, and that multi-resolution attention architectures and diffusion-based generative models both learn meaningful motion structure from video-derived pose data. Trained on 12,483 clips from diverse full-body movement, the diffusion model reduces overall prediction error by 30% compared to a standard transformer baseline, with a 31% improvement at the clinically relevant 2-second horizon.

Cross-domain evaluation on held-out data from a different capture modality confirms that the learned motion dynamics transfer — velocity prediction quality is essentially invariant to training corpus, even when the model has never seen the target domain. These results establish feasibility for a new class of motion foundation models: domain-agnostic systems trained on diverse human movement data that serve as the perceptual and predictive backbone for applications across rehabilitation, sports, robotics, and beyond.

References

[1] Cao, Z. et al. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CVPR 2017.
[2] Martinez, J. et al. On Human Motion Prediction using Recurrent Neural Networks. CVPR 2017.
[3] Ho, J. et al. Denoising Diffusion Probabilistic Models. NeurIPS 2020.
[4] Nichol, A. & Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. ICML 2021.
[5] Shazeer, N. et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017.
[6] Vaswani, A. et al. Attention Is All You Need. NeurIPS 2017.
[7] Bengio, S. et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. NeurIPS 2015.
[8] Zhang, C. et al. MoVi: A Large Multipurpose Motion and Video Dataset. PLoS ONE 2021.
[9] Mahmood, N. et al. AMASS: Archive of Motion Capture as Surface Shapes. ICCV 2019.
[10] Song, J. et al. Denoising Diffusion Implicit Models. ICLR 2021.