Technical Overview · April 2026

The Large Movement Model

A foundation model for human motion — treating pose sequences as a structured sequence domain to learn biomechanical structure, temporal coordination, and movement intent.

Aegis Station Infrastructure LLC  ·  aegisstation.com

1. Introduction

Human movement is one of the richest natural signals available. It encodes health status, task intent, emotional state, motor skill, and physical capability. Yet despite decades of work in biomechanics, computer vision, and machine learning, no unified AI framework exists that treats motion as a general-purpose sequence domain — one where the same foundational model can serve rehabilitation assessment, sports analytics, robotics control, and human-computer interaction.

The Large Movement Model (LMM) addresses this gap. LMM applies the same core insight that drives large language models — that sequence-learning architectures, given consistent tokenization and sufficient data, can discover generalizable latent structure — to human pose sequences. Where language models learn grammar and semantics from words, LMM learns biomechanical structure, temporal coordination, and movement intent from the evolution of body joint positions over time.

This document describes the LMM pipeline architecture, model designs, and Phase I experimental results. The work demonstrates that a multi-resolution attention architecture and a diffusion-based generative model both substantially outperform conventional baselines on motion forecasting tasks, establishing feasibility for a new class of embodied foundation models.


2. Data Pipeline

The LMM pipeline transforms raw video into structured, normalized, machine-learning-ready motion tokens. It is designed to ingest data from heterogeneous sources — different cameras, frame rates, resolutions, skeleton formats, and recording conditions — and produce a unified representation suitable for model training and evaluation.

2.1 Pose Estimation

Video inputs are processed through OpenPose to extract BODY_25 skeletal keypoints: 25 major body joints with per-joint confidence scores. For datasets already provided in skeleton format (Kinect, Vicon), the pipeline includes format converters that map alternative joint schemas to BODY_25 through documented correspondence tables. BODY_25 was selected as the Phase I standard because it balances representational fidelity with computational efficiency at roughly 1/100th the cost of mesh-based representations like SMPL.

2.2 Normalization

Raw keypoint sequences vary in scale, position, and temporal resolution. The normalization stage removes these extraneous variations:

2.3 Multi-Resolution Tokenization

Motion data is structured into three simultaneous token representations. All three reconstruct losslessly to the original — no information is discarded, only reorganized.

Token LevelShapeWhat It CapturesResolution
Frame(T, 75)Full body pose at each timestep66 ms
Window(T/5, 5, 75)Groups of 5 frames — movement phrases330 ms
Body-Part(T, 5, 21)5 anatomical regions: core, R/L arm, R/L leg66 ms (spatial)

The frame level provides the finest temporal grain. The window level captures meso-scale patterns (a step, an arm swing). The body-part level captures spatial coordination between anatomical regions. Together they give a model access to movement structure at multiple scales without committing to which scale matters most.

2.4 Pipeline Performance

The pipeline has been validated on over 3,900 clips from three independent sources spanning laboratory motion capture, clinical rehabilitation video, and depth-sensor skeleton data. Processing succeeded on all clips with zero failures.


3. Model Architectures

Two architectures were developed and evaluated, each taking 120 frames (8 seconds) of context and producing 30 predicted frames (2 seconds).

3.1 Flat Transformer Baseline

A standard encoder-decoder transformer operating on single-resolution frame tokens. Six encoder layers, six decoder layers, sinusoidal positional encodings based on physical time (seconds, not frame index). This serves as the controlled baseline.

3.2 Hierarchical Temporal Transformer (HTT)

The HTT processes all three token levels simultaneously through three parallel encoder streams connected by bidirectional cross-attention:

Input: 120 frames of context (8 seconds at 15 FPS) │ ┌───────────┼───────────┐ ▼ ▼ ▼ Frame Window Body-part tokens tokens tokens (120, 75) (24, 5, 75) (120, 5, 21) │ │ │ Frame enc Window enc Body-part enc (4 layers) (3 layers) (3 layers) │ │ │ └─────┬─────┘ │ Cross-attn (2L) │ frame ↔ window │ │ │ └────────┬────────┘ Cross-attn (2L) frame ↔ body-part │ Gated fusion │ Decoder (4 layers) │ 30 predicted frames

A gated linear combination fuses the three encoder outputs, and a 4-layer decoder generates the forecast autoregressively. Scheduled sampling addresses autoregressive drift by gradually transitioning the decoder from ground-truth to self-predicted inputs during training (0% → 50% over 25 epochs).

3.3 Motion Diffusion Model

Instead of predicting frame by frame, the diffusion model generates the full 30-frame forecast in a single pass by iteratively denoising random noise into a plausible motion trajectory, conditioned on the context.

Because diffusion generates all frames simultaneously, it does not suffer from compounding autoregressive error.


4. Experiments and Results

4.1 Setup

All models were trained on 1,044 clips (~2.2M frames), same 90/10 split by subject, identical hyperparameters: AdamW, lr 1e-4, batch 32, early stopping with patience 10. Task: 120 frames context → predict 30 frames. HTT and baseline evaluated autoregressively; diffusion evaluated as sample mean over 10 DDIM trajectories.

4.2 Results

ModelParamsFrame 1 (0.07s)Frame 15 (1.0s)Frame 30 (2.0s)Overall
Flat Baseline11.1M0.0400.7221.2390.755
HTT18.8M0.0470.5260.9540.514
Diffusion11.4M0.0680.2470.3230.243

Table 1. Motion forecasting MSE (lower is better). Primary corpus, 1,044 clips.

4.3 Analysis

Autoregressive drift is the central challenge. The flat baseline predicts the next frame reasonably (0.040) but degrades to 1.239 by frame 30 — a 31× increase. Single-resolution attention is insufficient for multi-second coherence.

Multi-resolution attention substantially reduces drift. The HTT reduces overall error by 32% and frame-30 error by 23%, with improvement concentrated at the 1–2 second horizons where maintaining coherence matters most.

Diffusion avoids drift entirely. Overall MSE is 68% lower than baseline, 53% lower than the HTT. Frame-30 error (0.323) is less than the baseline’s frame-1 error.

A note on comparison fairness. The HTT’s MSE comes from a single deterministic rollout. The diffusion model’s MSE is computed over the mean of 10 samples, which reduces variance. The comparison is directionally valid but the magnitude of improvement should be interpreted with this methodological difference in mind.

The two architectures are complementary. The HTT produces a single deterministic trajectory, frame by frame — suitable for real-time applications. The diffusion model produces a distribution over possible futures in batch — suitable for offline analysis and uncertainty quantification.


5. What the Model Learns

A model that predicts human motion accurately has learned something about how the body works. Several observations from the Phase I experiments:

Temporal coordination across timescales. The HTT’s window encoder captures meso-scale patterns the frame encoder misses: stride cadence, arc-and-return of reaching, weight transfer timing. Cross-attention constrains fine-grained predictions within these broader patterns — which is what reduces drift.

Spatial coordination across body regions. The body-part encoder learns how anatomical regions move relative to each other: contralateral arm-leg coordination in gait, trunk stabilization during limb movement, bilateral symmetry. These relationships emerge from the data and are precisely the kind of structure needed to detect compensatory movement patterns.

Biomechanical plausibility. Predicted sequences generally respect physical constraints: joints stay within anatomically plausible ranges, limb lengths remain approximately constant, center-of-mass trajectories stay continuous. These properties emerge from the data distribution without explicit enforcement through loss terms.


6. Cross-Domain Applicability

The pipeline and model architectures are domain-agnostic by design. The BODY_25 skeleton, normalization procedure, and multi-resolution tokenization make no assumptions about what kind of movement is being analyzed. Expanding to new domains requires domain-appropriate training data, not architectural changes.


7. Current Limitations


8. Relation to Prior Work

The individual components of the LMM architecture draw on established techniques: multi-resolution transformers, cross-attention fusion, gated combination layers, scheduled sampling (Bengio et al., 2015), and body-part spatial decomposition. LMM’s contribution is not architectural novelty but the synthesis of these into a coherent framework for motion as a sequence domain, with three distinguishing design decisions:


9. Architecture Summary

ComponentHTTDiffusion
Parameters18.8M11.4M
Encoder3-stream: frame (4L), window (3L), body-part (3L)Context encoder (6L)
FusionBidirectional cross-attn (4L) + gated combinationCross-attn in denoiser
Decoder / Generator4-layer autoregressive decoder6-layer denoiser, DDIM sampling
Embedding dim256, 8 heads, FFN 1024256, 8 heads, FFN 1024
Input120 frames (8s) — 3 token views120 frames (8s) — frame tokens
Output30 frames (2s), sequential30 frames (2s), single pass
Training time~4.5 hours (33 epochs)~2 hours (57 epochs)
InferenceDeterministic, real-time capableStochastic, batch-mode

10. Conclusion

The Large Movement Model demonstrates that human motion can be treated as a structured sequence domain analogous to text, and that multi-resolution attention architectures and diffusion-based generative models both learn meaningful motion structure from video-derived pose data. The HTT reduces autoregressive forecasting drift by 32% over a standard transformer baseline, and the diffusion model reduces overall prediction error by 68%.

These results establish feasibility for a new class of motion foundation models — domain-agnostic systems trained on diverse human movement data that can serve as the perceptual and predictive backbone for applications across rehabilitation, sports, robotics, and beyond.


References

[1] Cao, Z. et al. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CVPR 2017.
[2] Martinez, J. et al. On Human Motion Prediction using Recurrent Neural Networks. CVPR 2017.
[3] Ho, J. et al. Denoising Diffusion Probabilistic Models. NeurIPS 2020.
[4] Nichol, A. & Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. ICML 2021.
[5] Shazeer, N. et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017.
[6] Vaswani, A. et al. Attention Is All You Need. NeurIPS 2017.
[7] Bengio, S. et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. NeurIPS 2015.
[8] Zhang, C. et al. MoVi: A Large Multipurpose Motion and Video Dataset. PLoS ONE 2021.
[9] Mahmood, N. et al. AMASS: Archive of Motion Capture as Surface Shapes. ICCV 2019.
[10] Song, J. et al. Denoising Diffusion Implicit Models. ICLR 2021.