Recoleta Item Note

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

world-modelself-supervised-learningmultimodal-learningspatiotemporal-encodingearth-observation

Summary

DeepEarth proposes a self-supervised multimodal world model and introduces Earth4D, a planetary-scale 4D spatiotemporal positional encoder, to learn unified representations of Earth observation data. Its core claim is that, using only spatiotemporal coordinates and a small amount of metadata, it can outperform baseline models that use more modalities and larger pre-training datasets on ecological forecasting tasks.

Problem

Existing Earth observation/world models struggle to represent continuous 4D spatiotemporal information while simultaneously operating at global scale, long time horizons, and high spatial-temporal precision.
Multimodal Earth data (images, text, sensors, remote sensing) have complex distributions, and without strong spatiotemporal inductive bias, unified modeling and prediction are limited.
This matters because ecological forecasting and disaster risk assessment (such as wildfire-related vegetation moisture content) depend on accurately modeling “what will happen, where, and when.”

Approach

Proposes Earth4D: extending traditional 3D multi-resolution hash encoding to 4D by using four parallelizable 3D grids (xyz, xyt, yzt, xzt) to approximately model the joint spatiotemporal structure of (latitude, longitude, elevation, time).
Each grid uses a multi-resolution hash table, allowing coverage of planetary scale, across centuries under a fixed memory budget, and claims to achieve sub-meter, sub-second precision.
DeepEarth fuses Earth4D spatiotemporal embeddings with outputs from modality encoders (such as vision/language encoders) into tokens, and performs masked reconstruction-style self-supervised training in an autoencoder context window to learn joint distributions and support generative reconstruction/simulation.
To mitigate hash collisions, the authors add learned hash probing, enabling the model to learn better hash assignments from candidate indices, improving representation efficiency and downstream performance.

Results

On the Globe-LFMC 2.0 ecological forecasting benchmark, Earth4D achieves MAE 11.7pp, RMSE 18.7pp, R² 0.783 for Live Fuel Moisture Content prediction.
Compared with the baseline Galileo (pre-trained), whose inputs include remote sensing imagery + weather + terrain + (x,y,z,t) + species type and achieves MAE 12.6pp, RMSE 18.9pp, R² 0.72, Earth4D performs better using only (x,y,z,t) + species name.
The test set contains 13,297 samples; the figure reports a median absolute error of 7.1pp, and shows good tracking of seasonal variation over the 2017–2023 period.
Ablation results show that standard hash encoding without learned probing yields RMSE 26.0pp, MAE 16.6pp, R² 0.58; adding learned probing improves this to RMSE 18.7pp, MAE 11.7pp, R² 0.783, i.e. a 29.5% reduction in MAE and a 35.0% increase in R².
An extreme compressed version reduces the model from 800M parameters to 5M parameters (99.3% reduction, hash capacity 2^14), yet still achieves MAE 15.0pp / R² 0.668, with R² 14.7% higher than the 800M no-probing baseline, while delivering 4× faster training and 93% lower memory usage.
The appendix also claims that, on an RGB reconstruction task, learned probing further reduces validation loss by 18%; and attributes the performance gains to a 33% reduction in hash collisions under a 1M-point simulation.

Link

http://arxiv.org/abs/2603.07039v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart