The shape of the GEO Index.
The GEO Index is a structured, query-ready view of the human transcriptomics corpus — metadata enriched across every series, and ARCHS4-anchored expression for every Homo sapiens study (bulk and single-cell, gene-level). Find a contrast and the expression matrix is there in the same call.
Snapshot from corpus_overview · pipeline v3.5 · April 30, 2026.
GEO has the data. What it lacks is structure — disease names live in free-text characteristics, study design is buried in curator notes, and there is no native concept of a case-control contrast.
Devano fills those gaps: every contrast typed, every disease ontology-grounded, every label graded against a four-state evidence rubric traceable to raw sample characteristics.
300+ MONDO-grounded diseases, queryable at cohort scale.
Disease names in raw GEO are free-text strings — "breast cancer" and "BC" are different strings with no shared identity, and "invasive ductal carcinoma" looks unrelated to either unless you already know it's a breast-cancer subtype. The long tail of less-common conditions stays invisible to anyone without a curated thesaurus.
Devano resolves every disease mention to MONDO, collapses lexical aliases (“BC” → “breast cancer”), and uses the ontology hierarchy — so a query for breast cancer surfaces every IDC, ILC, and DCIS study in the corpus.
The corpus isn't oncology-only — neuro, infectious disease, GI, and autoimmune studies make up the long tail and stay queryable as named MONDO branches, not lost in an “Other” bucket.
Sample-label hits per MONDO-grounded disease. A single GSM can appear under multiple diseases when contrast arms span comorbidities.
The curve drops sharply after the headline picks, then carries hundreds of additional diseases — depth doesn't end at the chart above.
Even past the top-ranked diseases, sample-label volume stays in the thousands — low-prevalence conditions remain cohort-grade queryable rather than anecdotal.
Most GEO samples never report the demographic fields that biomarker and generalizability work depends on first.
The GEO Index surfaces sex and ancestry as filterable facets where they exist, and marks the absence everywhere else — so a cohort selection requiring sex stratification doesn't silently include samples that never reported it.
Filter the corpus by design class, contrast type, and investigation intent — structure raw GEO doesn't expose.
Search GEO for longitudinal studies as a structured DataSet Type and you get zero results; the same query as free text returns 8,067 — an uncalibrated full-text scan over summary prose. The same holds for drug perturbation, case-control, and method validation: design lives in the free-text summary, encoded as semi-structured textual descriptions that can't be queried as data.[2]
Devano runs LLM extraction over every summary and assigns three orthogonal facets — design class, investigation intent, contrast type — and the charts below are counts over those extracted facets. A signature is a grouping of related samples (e.g. all treated arms); a contrast is a meaningful juxtaposition of two signatures (e.g. treated vs. control) — the unit you'd run differential expression on.
The corpus is overwhelmingly perturbation biology. ~70% of structurally enriched studies inject a treatment, knockout, exposure, or other intervention — not a passive disease atlas.
Devano reads the study summary to determine intent, not just title keywords — disease-focused dominates because the corpus is heavily clinical-translational.
Most contrasts compare a drug, perturbation, or exposure against a baseline arm. Color: clinical (treatment, case-control), biology (temporal, spatial), exposure (dose-response).
Every label, audit-traceable.
Anyone can publish clean-looking labels; few can show you why each one was assigned. A 2023 assessment of 1,233 in vivo toxicology datasets in GEO found none met the MIATE minimum reporting standard.[4]
Every Devano label keeps its full provenance chain — raw characteristics blob, extracted tags, signature reasoning paragraph, and the ontology-mapped call — graded direct, inferred, plausible, or tenuous. The 99.9% direct-or-inferred figure below is auditable per-sample, not just asserted.
of labels are direct or inferred evidence.
The rubric is a near-binary quality gate. The pipeline suppresses low-confidence labels by design — <0.1% of all labels are tagged plausible or tenuous.
1.60M direct · 627.9k inferred · 1,259 plausible · 334 tenuous
cell line: H446 cell type: SCLC xenograft: yes treatment: JIB-04 time: 3x per week
cell_line: "H446" disease_text: "SCLC" model: "xenograft" treatment: "JIB-04" timing: "3x/week"
"Across this signature the cell_line and model tags are constant (H446 SCLC xenograft) while treatment varies between vehicle and JIB-04 at matched timepoints — a clean drug-response arm…"
material: h446 SCLC xenograft biosample: BTO_0002206 disease: MONDO_0008433 treatment_type: drug treatment_name: jib-04 evidence: direct
Every drug, gene, and exposure tagged at signature level.
The corpus is a perturbation atlas, not a disease atlas.
In raw GEO, the treatment field is free text — the same compound appears under a dozen names across cohorts, and dose and timepoint live nowhere structured at all. Normalize those tags and the queries change shape: signature-reversal — the technique behind the Connectivity Map's 1.3M-profile compendium[5] — can surface novel inhibitors directly.
Devano extracts perturbation mode, treatment name, dose, and timepoint per contrast arm — the facets the charts below are built on.
Studies with an active perturbation arm. Multi-value: one study can carry several tags.
Tissues and assays, unified.
Looking for "blood-derived" or "stem cell origin" studies in GEO? These concepts aren't easy to find when wading through the raw data. Devano normalizes tissue names against BTO, CL, and UBERON and rolls them up to families, so the distributions are ontology-resolved counts rather than string matches.
Epigenomic co-assays still surface via text-matching today, with a first-class
assay_type bucket scheduled for pipeline v3.6.
Tissue counts are over validated GEO sample submissions, not over the ARCHS4-quantified subset — some rows (e.g. cultured mouse embryonic stem cells) appear here as metadata-only entries.
Both bulk and scRNA-seq are gene-level — single-cell samples are pseudobulk-equivalent in this corpus; cell-level resolution (UMAPs, cluster labels) lives in supplementary files we don't ingest.
Single-cell atlases (CellxGene, HCA, Tabula Sapiens) are the right home for cross-tissue cell-type reference work. The GEO Index is the complement — 20 years of human bulk and gene-level scRNA-seq across clinical-trial cohorts, perturbation screens, and toxicology that sc-atlases don't ingest and that legacy GEO can't query.
Built to scan. Built to drill.
Most data products force agents into one resolution — a 5-line summary or a 50-page dump. The Devano MCP surface lets an agent orient in a couple hundred tokens, search in fifty tokens per result, open any study in a few hundred more, and trace any label back to the raw submission text. Breadth and provenance from the same handful of calls.
One MCP. Four resolutions. Orient → search → open → audit.
Four storage layers feed every answer — raw GEO submissions, AI-extracted structure, validated ontology-mapped calls, and pre-computed indices. Progressive disclosure by design: an agent pulls only the resolution it needs.
Headline counts emit from corpus_overview
against the three-layer DuckDB build at pipeline geo-index-v3.5, snapshot 2026-04-30.
Numbers refresh after each pipeline rebuild — typically monthly.
Series are classified by dominant organism; Hs series may contain mixed-organism
samples for xenograft and comparative designs.
- The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository. Genome Biology, 2025. doi:10.1186/s13059-025-03725-0
- Hu & Wang. Mining data and metadata from the gene expression omnibus. Biophysical Reviews 10(6), 2018. doi:10.1007/s12551-018-0490-8
- Lim, Tesar, Belmadani, et al. Curation of over 10,000 transcriptomic studies to enable data reuse. Database, 2021. doi:10.1093/database/baab006 · Gemma curation; sex-label discrepancies and sample-field completeness statistics.
- Nault R, Cave MC, Ludewig G, Moseley HNB, Pennell KG, Zacharewski T. A Case for Accelerating Standards to Achieve the FAIR Principles of Environmental Health Research Experimental Data. Environmental Health Perspectives 131(6):065001, 2023. doi:10.1289/EHP11484 · Assessment of 1,233 in vivo GEO toxicology datasets against the MIATE/invivo minimum-reporting standard; reference repository at github.com/zacharewskilab/MIATE.
- Subramanian, Narayan, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171(6):1437–1452, 2017. doi:10.1016/j.cell.2017.10.049