CORPUS OVERVIEW

The shape of the GEO Index.

The GEO Index is a structured, query-ready view of the human transcriptomics corpus — metadata enriched across every series, and ARCHS4-anchored expression for every Homo sapiens study (bulk and single-cell, gene-level). Find a contrast and the expression matrix is there in the same call.

Studies
42,559
36,797 validated · 525 quarantined · ~5,237 pending re-ingest
Samples (GSM)
1.24M
validated, from 1.45M raw GEO samples
Sample-labels
2.23M
structured per-arm labels · +53% amplification over GSMs
Structured contrasts
120,589
~3.2 contrasts per study · the unit raw GEO doesn't have

Snapshot from corpus_overview · pipeline v3.5 · April 30, 2026.

THE PROBLEM

GEO has the data. What it lacks is structure — disease names live in free-text characteristics, study design is buried in curator notes, and there is no native concept of a case-control contrast.

2025 GENOME BIOLOGY AUDIT · 61,312 GEO SERIES
11.5% share required metadata in full
37.9% share less than 40% of phenotypic fields[1]

Devano fills those gaps: every contrast typed, every disease ontology-grounded, every label graded against a four-state evidence rubric traceable to raw sample characteristics.

02 · Diseases & therapeutic areas

300+ MONDO-grounded diseases, queryable at cohort scale.

Disease names in raw GEO are free-text strings — "breast cancer" and "BC" are different strings with no shared identity, and "invasive ductal carcinoma" looks unrelated to either unless you already know it's a breast-cancer subtype. The long tail of less-common conditions stays invisible to anyone without a curated thesaurus.

WHAT WE DO

Devano resolves every disease mention to MONDO, collapses lexical aliases (“BC” → “breast cancer”), and uses the ontology hierarchy — so a query for breast cancer surfaces every IDC, ILC, and DCIS study in the corpus.

The corpus isn't oncology-only — neuro, infectious disease, GI, and autoimmune studies make up the long tail and stay queryable as named MONDO branches, not lost in an “Other” bucket.

Top diseases by sample-label volume source · validated_sample_labels (deduped, neuro breadth reserved)
Breast cancer
28,348
Colorectal cancer
19,832
Type 2 diabetes
18,326
Glioblastoma
14,968
Prostate cancer
14,036
Acute myeloid leukemia
13,950
Melanoma
10,629
Non-small cell lung cancer
9,185
Hepatocellular carcinoma
8,963
Neuroblastoma
8,734
Pancreatic ductal adenocarcinoma
6,496
Severe dengue
6,262
COVID-19
6,249
Crohn's disease
5,713
Psoriasis
5,695
Oncology
Metabolic
Neuro
Autoimmune
GI
Infectious
Other

Sample-label hits per MONDO-grounded disease. A single GSM can appear under multiple diseases when contrast arms span comorbidities.

LONG-TAIL COVERAGE
300+ distinct MONDO diseases with ≥10 studies

The curve drops sharply after the headline picks, then carries hundreds of additional diseases — depth doesn't end at the chart above.

Even past the top-ranked diseases, sample-label volume stays in the thousands — low-prevalence conditions remain cohort-grade queryable rather than anecdotal.

SAMPLE-LEVEL METADATA GAP

Most GEO samples never report the demographic fields that biomarker and generalizability work depends on first.

13.8%of human samples report sex
4.2%report ancestry[3]

The GEO Index surfaces sex and ancestry as filterable facets where they exist, and marks the absence everywhere else — so a cohort selection requiring sex stratification doesn't silently include samples that never reported it.

03 · Study design

Filter the corpus by design class, contrast type, and investigation intent — structure raw GEO doesn't expose.

Search GEO for longitudinal studies as a structured DataSet Type and you get zero results; the same query as free text returns 8,067 — an uncalibrated full-text scan over summary prose. The same holds for drug perturbation, case-control, and method validation: design lives in the free-text summary, encoded as semi-structured textual descriptions that can't be queried as data.[2]

Devano runs LLM extraction over every summary and assigns three orthogonal facets — design class, investigation intent, contrast type — and the charts below are counts over those extracted facets. A signature is a grouping of related samples (e.g. all treated arms); a contrast is a meaningful juxtaposition of two signatures (e.g. treated vs. control) — the unit you'd run differential expression on.

Experimental design class 36,797 enriched studies
Interventional perturbation
26,29471.5%
Comparative disease
6,22716.9%
Observational atlas
2,3816.5%
Longitudinal process
1,1993.3%
Methods benchmark
6871.9%

The corpus is overwhelmingly perturbation biology. ~70% of structurally enriched studies inject a treatment, knockout, exposure, or other intervention — not a passive disease atlas.

Investigation intent 36,797 studies · summary-read
Disease-focused · 22,07860.0%
Reference building · 11,03930.0%
Method validation · 3,68010.0%

Devano reads the study summary to determine intent, not just title keywords — disease-focused dominates because the corpus is heavily clinical-translational.

Contrast type 120,589 contrasts
120.6k
contrasts
Treatment response71,28359.1%
Other17,47214.5%
Case-control16,56213.7%
Temporal12,55010.4%
Spatial1,9791.6%
Dose-response7430.6%

Most contrasts compare a drug, perturbation, or exposure against a baseline arm. Color: clinical (treatment, case-control), biology (temporal, spatial), exposure (dose-response).

04 · Evidence quality & provenance

Every label, audit-traceable.

Anyone can publish clean-looking labels; few can show you why each one was assigned. A 2023 assessment of 1,233 in vivo toxicology datasets in GEO found none met the MIATE minimum reporting standard.[4]

Every Devano label keeps its full provenance chain — raw characteristics blob, extracted tags, signature reasoning paragraph, and the ontology-mapped call — graded direct, inferred, plausible, or tenuous. The 99.9% direct-or-inferred figure below is auditable per-sample, not just asserted.

QUALITY FLOOR
99.9%

of labels are direct or inferred evidence.

The rubric is a near-binary quality gate. The pipeline suppresses low-confidence labels by design — <0.1% of all labels are tagged plausible or tenuous.

direct · 1,603,67371.8%
inferred · 627,92728.1%
plausible · 1,2590.1%
tenuous · 3340.0%

1.60M direct · 627.9k inferred · 1,259 plausible · 334 tenuous

Provenance trace · GSM7423891 — JIB-04 H446 xenograft, day 21 pipeline · geo-index-v3.5
01 · RAWcharacteristics_ch1
cell line: H446
cell type: SCLC
xenograft: yes
treatment: JIB-04
time: 3x per week
02 · TAGGEDtag_values
cell_line: "H446"
disease_text: "SCLC"
model: "xenograft"
treatment: "JIB-04"
timing: "3x/week"
03 · REASONEDsignature reasoning

"Across this signature the cell_line and model tags are constant (H446 SCLC xenograft) while treatment varies between vehicle and JIB-04 at matched timepoints — a clean drug-response arm…"

04 · EXTRACTEDlabel
material: h446 SCLC xenograft
biosample: BTO_0002206
disease: MONDO_0008433
treatment_type: drug
treatment_name: jib-04
evidence: direct
Every sample card in the GEO Index opens this view. Click any sample → see the raw text on the left, the ontology-mapped label on the right, the reasoning in between.
05 · Perturbations & treatments

Every drug, gene, and exposure tagged at signature level.

The corpus is a perturbation atlas, not a disease atlas.

In raw GEO, the treatment field is free text — the same compound appears under a dozen names across cohorts, and dose and timepoint live nowhere structured at all. Normalize those tags and the queries change shape: signature-reversal — the technique behind the Connectivity Map's 1.3M-profile compendium[5] — can surface novel inhibitors directly.

Devano extracts perturbation mode, treatment name, dose, and timepoint per contrast arm — the facets the charts below are built on.

Perturbation mode study tags · non-exclusive
Genetic
14,820
Chemical
10,996
Biological
5,646
Environmental
2,267
None
1,245

Studies with an active perturbation arm. Multi-value: one study can carry several tags.

Dose × time sophistication explicit dose-response contrasts
743 dose-response contrasts across 312 studies
ENTRECTINIB
NGN2 neurons · concentration sweep10 pts
BAY 2506856
3 doses × 4 timepoints12 pts
PETROLEUM
3 doses × 9 cell types27 pts
HYPOXIA
O₂ concentration gradient6 pts
RADIATION
cumulative Gy5 pts
treatment_type vocabulary 6 categories · signature-level
Drug31,418 sigs
Decitabine Paclitaxel Cisplatin Palbociclib Vorinostat Azacytidine Imatinib
Gene perturbation14,820 sigs
MECOM KO EZH2 KO NUP98-HOXA9 OE CHD8 shRNA CRISPR sgEED
Biologic5,646 sigs
Mepolizumab Anti-PD-1 Dupilumab Secukinumab Anti-CD19 CAR-T IL-33
Chemical exposure2,267 sigs
Hypoxia Smoking Radiation Heat shock Particulate matter Altitude
Vehicle3,180 sigs
DMSO PBS Placebo Scrambled siRNA Empty vector
None1,245 sigs
Wildtype Pre-treatment Untreated baseline Healthy donor
DEEPLY REPLICATED PERTURBATIONS
MECOM
5+ studies · HSC/AML
Mepolizumab
4+ RCTs · asthma
Decitabine
paired DNMT-inhibitor · AML
Anti-PD-1
pre/post biopsies · BCC/SCC
NUP98
4+ leukemia studies
06 · Tissues & technology

Tissues and assays, unified.

Looking for "blood-derived" or "stem cell origin" studies in GEO? These concepts aren't easy to find when wading through the raw data. Devano normalizes tissue names against BTO, CL, and UBERON and rolls them up to families, so the distributions are ontology-resolved counts rather than string matches.

Epigenomic co-assays still surface via text-matching today, with a first-class assay_type bucket scheduled for pipeline v3.6.

Top tissues / biosamples validated sample-labels · top 12
Whole blood
31,060
PBMC
15,623
Peripheral blood mononuclear cells (from blood)
14,596
Cultured mouse embryonic stem cells
13,986
Human pancreatic islet cells
12,556
Synthesized DNA
11,311
Peripheral blood mononuclear cells
10,421
MCF7 breast cancer cell line
9,228
Lymphoblastoid cell line
8,049
Single LCL-derived iPSCs (Yoruba)
7,584
CD8+ T cells from human skin
7,312
CD4+ T cells from PBMC
6,415

Tissue counts are over validated GEO sample submissions, not over the ARCHS4-quantified subset — some rows (e.g. cultured mouse embryonic stem cells) appear here as metadata-only entries.

BROADER FILTER COVERAGE — samples matching the tissue family
Blood (all)364,836
T cell240,626
Stem cell162,159
Breast110,014
Bone marrow69,950
Lung63,624
Brain57,305
Skin60,215
Liver39,987
iPSC39,941
Assay mix 36,797 studies
36.8k
studies
Bulk RNA-seq26,49472.0%
Single-cell RNA-seq8,09522.0%
Other2,2086.0%

Both bulk and scRNA-seq are gene-level — single-cell samples are pseudobulk-equivalent in this corpus; cell-level resolution (UMAPs, cluster labels) lives in supplementary files we don't ingest.

Beyond transcriptomics epigenomic co-assays · text-matched
30,592
studies mention ChIP-seq
30,463
studies mention ATAC-seq
8,914
studies mention WGBS / methylation
ROADMAP These don't yet have a dedicated assay_type bucket — they're currently surfaced via free-text and relevance_tags. First-class epigenomics support lands in pipeline v3.6.
WHERE THIS FITS

Single-cell atlases (CellxGene, HCA, Tabula Sapiens) are the right home for cross-tissue cell-type reference work. The GEO Index is the complement — 20 years of human bulk and gene-level scRNA-seq across clinical-trial cohorts, perturbation screens, and toxicology that sc-atlases don't ingest and that legacy GEO can't query.

07 · Built for agents

Built to scan. Built to drill.

Most data products force agents into one resolution — a 5-line summary or a 50-page dump. The Devano MCP surface lets an agent orient in a couple hundred tokens, search in fifty tokens per result, open any study in a few hundred more, and trace any label back to the raw submission text. Breadth and provenance from the same handful of calls.

D
Claude · Devano MCP
geo−index

One MCP. Four resolutions. Orient → search → open → audit.

01 · Orient
corpus_overview
~200 tokens
What's in here?
02 · Search
search_studies
~50 / result
Find what fits.
03 · Open
study_detail
~500+ / record
Inspect a study.
04 · Audit
sample_evidence
raw → label trace
Show me why.

Four storage layers feed every answer — raw GEO submissions, AI-extracted structure, validated ontology-mapped calls, and pre-computed indices. Progressive disclosure by design: an agent pulls only the resolution it needs.

METHODOLOGY

Headline counts emit from corpus_overview against the three-layer DuckDB build at pipeline geo-index-v3.5, snapshot 2026-04-30. Numbers refresh after each pipeline rebuild — typically monthly. Series are classified by dominant organism; Hs series may contain mixed-organism samples for xenograft and comparative designs.

SOURCES
  1. The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository. Genome Biology, 2025. doi:10.1186/s13059-025-03725-0
  2. Hu & Wang. Mining data and metadata from the gene expression omnibus. Biophysical Reviews 10(6), 2018. doi:10.1007/s12551-018-0490-8
  3. Lim, Tesar, Belmadani, et al. Curation of over 10,000 transcriptomic studies to enable data reuse. Database, 2021. doi:10.1093/database/baab006 · Gemma curation; sex-label discrepancies and sample-field completeness statistics.
  4. Nault R, Cave MC, Ludewig G, Moseley HNB, Pennell KG, Zacharewski T. A Case for Accelerating Standards to Achieve the FAIR Principles of Environmental Health Research Experimental Data. Environmental Health Perspectives 131(6):065001, 2023. doi:10.1289/EHP11484 · Assessment of 1,233 in vivo GEO toxicology datasets against the MIATE/invivo minimum-reporting standard; reference repository at github.com/zacharewskilab/MIATE.
  5. Subramanian, Narayan, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171(6):1437–1452, 2017. doi:10.1016/j.cell.2017.10.049