MetaOmics-10T

Arvid E. Gollwitzer
September 24, 2025
3 min

The Foundational Dataset to Unlock Causal Modeling of Microbial Ecosystems

Abstract

We propose MetaOmics-10T—an openly shareable, foundational dataset to unlockAI-accelerated discovery in microbial ecosystems. The dataset directly enables three high-impact AI tasks: (1) forecasting ecosystem dynamics, (2) predicting counterfactual outcomes of interventions, and (3) inverse-design of microbial therapies under safety constraints. MetaOmics-10T combines 10 trillion base pairs reclaimed from public archives using a Quality-Aware Tokenization (QA-Token) framework with 100,000+ interventional trajectories generated via model-guided experimental design. The result is a first-of-its-kind, probabilistic, intervention ready corpus that addresses the principal bottleneck for causal modeling in microbiome science and provides an empirical testbed to assess the reach and limits of causal inference at scale.

Key Results

MetaOmics-10T represents an unprecedented leap, enabling:

  • 10 trillion base pairs of genetic data—1,000× larger than current datasets
  • 10 million samples across diverse environments
  • 5,000+ metabolic features tracked at single-nucleotide resolution

This unlocks many transformative applications, from personalized microbiome therapies designed to specifications to rapid drug discovery by predicting microbial interactions.

Share this post
anto.com/publications/
metaomics-10t
Microbiome Foundation Models

Making sense of the microbial world

Anto is building multimodal foundation models for microbial communities, making the gut microbiome computable for the first time. We predict drug toxicity and efficacy across diverse populations and optimize molecules for universal efficacy — addressing the microbiome-driven causes of drug response and failures.