CORPUS OVERVIEW

The shape of the GEO Index.

The GEO Index is a structured, query-ready view of the human transcriptomics corpus — metadata enriched across every series, and ARCHS4-anchored expression for every Homo sapiens study (bulk and single-cell, gene-level). Find a contrast and the expression matrix is there in the same call.

Studies

42,559

36,797 validated · 525 quarantined · ~5,237 pending re-ingest

Samples (GSM)

1.24M

validated, from 1.45M raw GEO samples

Sample-labels

2.23M

structured per-arm labels · +53% amplification over GSMs

Structured contrasts

120,589

~3.2 contrasts per study · the unit raw GEO doesn't have

Snapshot from corpus_overview · pipeline v3.5 · April 30, 2026.

THE PROBLEM

GEO has the data. What it lacks is structure — disease names live in free-text characteristics, study design is buried in curator notes, and there is no native concept of a case-control contrast.

2025 GENOME BIOLOGY AUDIT · 61,312 GEO SERIES

11.5% share required metadata in full

37.9% share less than 40% of phenotypic fields[1]

Devano fills those gaps: every contrast typed, every disease ontology-grounded, every label graded against a four-state evidence rubric traceable to raw sample characteristics.

02 · Diseases & therapeutic areas

300+ MONDO-grounded diseases, queryable at cohort scale.

Disease names in raw GEO are free-text strings — "breast cancer" and "BC" are different strings with no shared identity, and "invasive ductal carcinoma" looks unrelated to either unless you already know it's a breast-cancer subtype. The long tail of less-common conditions stays invisible to anyone without a curated thesaurus.

WHAT WE DO

Devano resolves every disease mention to MONDO, collapses lexical aliases (“BC” → “breast cancer”), and uses the ontology hierarchy — so a query for breast cancer surfaces every IDC, ILC, and DCIS study in the corpus.

The corpus isn't oncology-only — neuro, infectious disease, GI, and autoimmune studies make up the long tail and stay queryable as named MONDO branches, not lost in an “Other” bucket.

Top diseases by sample-label volume source · validated_sample_labels (deduped, neuro breadth reserved)

Breast cancer

28,348

Colorectal cancer

19,832

Type 2 diabetes

18,326

Glioblastoma

14,968

Prostate cancer

14,036

Acute myeloid leukemia

13,950

Melanoma

10,629

Non-small cell lung cancer

9,185

Hepatocellular carcinoma

8,963

Neuroblastoma

8,734

Pancreatic ductal adenocarcinoma

6,496

Severe dengue

6,262

COVID-19

6,249

Crohn's disease

5,713

Psoriasis

5,695

Oncology

Metabolic

Neuro

Autoimmune

Infectious

Other

Sample-label hits per MONDO-grounded disease. A single GSM can appear under multiple diseases when contrast arms span comorbidities.

LONG-TAIL COVERAGE

300+ distinct MONDO diseases with ≥10 studies

The curve drops sharply after the headline picks, then carries hundreds of additional diseases — depth doesn't end at the chart above.

Even past the top-ranked diseases, sample-label volume stays in the thousands — low-prevalence conditions remain cohort-grade queryable rather than anecdotal.

SAMPLE-LEVEL METADATA GAP

Most GEO samples never report the demographic fields that biomarker and generalizability work depends on first.

13.8%of human samples report sex

4.2%report ancestry[3]

The GEO Index surfaces sex and ancestry as filterable facets where they exist, and marks the absence everywhere else — so a cohort selection requiring sex stratification doesn't silently include samples that never reported it.

03 · Study design

Filter the corpus by design class, contrast type, and investigation intent — structure raw GEO doesn't expose.

Search GEO for longitudinal studies as a structured DataSet Type and you get zero results; the same query as free text returns 8,067 — an uncalibrated full-text scan over summary prose. The same holds for drug perturbation, case-control, and method validation: design lives in the free-text summary, encoded as semi-structured textual descriptions that can't be queried as data.[2]

Devano runs LLM extraction over every summary and assigns three orthogonal facets — design class, investigation intent, contrast type — and the charts below are counts over those extracted facets. A signature is a grouping of related samples (e.g. all treated arms); a contrast is a meaningful juxtaposition of two signatures (e.g. treated vs. control) — the unit you'd run differential expression on.

Experimental design class 36,797 enriched studies

Interventional perturbation

26,29471.5%

Comparative disease

6,22716.9%

Observational atlas

2,3816.5%

Longitudinal process

1,1993.3%

Methods benchmark

6871.9%

The corpus is overwhelmingly perturbation biology. ~70% of structurally enriched studies inject a treatment, knockout, exposure, or other intervention — not a passive disease atlas.

Investigation intent 36,797 studies · summary-read

Disease-focused · 22,07860.0%

Reference building · 11,03930.0%

Method validation · 3,68010.0%

Devano reads the study summary to determine intent, not just title keywords — disease-focused dominates because the corpus is heavily clinical-translational.

Contrast type 120,589 contrasts

120.6k

contrasts

Treatment response71,28359.1%

Other17,47214.5%

Case-control16,56213.7%

Temporal12,55010.4%

Spatial1,9791.6%

Dose-response7430.6%

Most contrasts compare a drug, perturbation, or exposure against a baseline arm. Color: clinical (treatment, case-control), biology (temporal, spatial), exposure (dose-response).

04 · Evidence quality & provenance

Every label, audit-traceable.

Anyone can publish clean-looking labels; few can show you why each one was assigned. A 2023 assessment of 1,233 in vivo toxicology datasets in GEO found none met the MIATE minimum reporting standard.[4]

Every Devano label keeps its full provenance chain — raw characteristics blob, extracted tags, signature reasoning paragraph, and the ontology-mapped call — graded direct, inferred, plausible, or tenuous. The 99.9% direct-or-inferred figure below is auditable per-sample, not just asserted.

QUALITY FLOOR

99.9%

of labels are direct or inferred evidence.

The rubric is a near-binary quality gate. The pipeline suppresses low-confidence labels by design — <0.1% of all labels are tagged plausible or tenuous.

direct · 1,603,67371.8%

inferred · 627,92728.1%

plausible · 1,2590.1%

tenuous · 3340.0%

1.60M direct · 627.9k inferred · 1,259 plausible · 334 tenuous

Provenance trace · GSM7423891 — JIB-04 H446 xenograft, day 21 pipeline · geo-index-v3.5

01 · RAWcharacteristics_ch1

cell line: H446
cell type: SCLC
xenograft: yes
treatment: JIB-04
time: 3x per week

02 · TAGGEDtag_values

cell_line: "H446"
disease_text: "SCLC"
model: "xenograft"
treatment: "JIB-04"
timing: "3x/week"

03 · REASONEDsignature reasoning

"Across this signature the cell_line and model tags are constant (H446 SCLC xenograft) while treatment varies between vehicle and JIB-04 at matched timepoints — a clean drug-response arm…"

04 · EXTRACTEDlabel

material: h446 SCLC xenograft
biosample: BTO_0002206
disease: MONDO_0008433
treatment_type: drug
treatment_name: jib-04
evidence: direct

Every sample card in the GEO Index opens this view. Click any sample → see the raw text on the left, the ontology-mapped label on the right, the reasoning in between.

05 · Perturbations & treatments

Every drug, gene, and exposure tagged at signature level.

The corpus is a perturbation atlas, not a disease atlas.

In raw GEO, the treatment field is free text — the same compound appears under a dozen names across cohorts, and dose and timepoint live nowhere structured at all. Normalize those tags and the queries change shape: signature-reversal — the technique behind the Connectivity Map's 1.3M-profile compendium[5] — can surface novel inhibitors directly.

Devano extracts perturbation mode, treatment name, dose, and timepoint per contrast arm — the facets the charts below are built on.

Perturbation mode study tags · non-exclusive

Genetic

14,820

Chemical

10,996

Biological

5,646

Environmental

2,267

None

1,245

Studies with an active perturbation arm. Multi-value: one study can carry several tags.

Dose × time sophistication explicit dose-response contrasts

743 dose-response contrasts across 312 studies

ENTRECTINIB

NGN2 neurons · concentration sweep10 pts

BAY 2506856

3 doses × 4 timepoints12 pts

PETROLEUM

3 doses × 9 cell types27 pts

HYPOXIA

O₂ concentration gradient6 pts

RADIATION

cumulative Gy5 pts

treatment_type vocabulary 6 categories · signature-level

Drug31,418 sigs

Decitabine Paclitaxel Cisplatin Palbociclib Vorinostat Azacytidine Imatinib

Gene perturbation14,820 sigs

MECOM KO EZH2 KO NUP98-HOXA9 OE CHD8 shRNA CRISPR sgEED

Biologic5,646 sigs

Mepolizumab Anti-PD-1 Dupilumab Secukinumab Anti-CD19 CAR-T IL-33

Chemical exposure2,267 sigs

Hypoxia Smoking Radiation Heat shock Particulate matter Altitude

Vehicle3,180 sigs

DMSO PBS Placebo Scrambled siRNA Empty vector

None1,245 sigs

Wildtype Pre-treatment Untreated baseline Healthy donor

DEEPLY REPLICATED PERTURBATIONS

MECOM

5+ studies · HSC/AML

Mepolizumab

4+ RCTs · asthma

Decitabine

paired DNMT-inhibitor · AML

Anti-PD-1

pre/post biopsies · BCC/SCC

NUP98

4+ leukemia studies

06 · Tissues & technology

Tissues and assays, unified.

Looking for "blood-derived" or "stem cell origin" studies in GEO? These concepts aren't easy to find when wading through the raw data. Devano normalizes tissue names against BTO, CL, and UBERON and rolls them up to families, so the distributions are ontology-resolved counts rather than string matches.

Epigenomic co-assays still surface via text-matching today, with a first-class assay_type bucket scheduled for pipeline v3.6.

Top tissues / biosamples validated sample-labels · top 12

Whole blood

31,060

PBMC

15,623

Peripheral blood mononuclear cells (from blood)

14,596

Cultured mouse embryonic stem cells

13,986

Human pancreatic islet cells

12,556

Synthesized DNA

11,311

Peripheral blood mononuclear cells

10,421

MCF7 breast cancer cell line

9,228

Lymphoblastoid cell line

8,049

Single LCL-derived iPSCs (Yoruba)

7,584

CD8+ T cells from human skin

7,312

CD4+ T cells from PBMC

6,415

Tissue counts are over validated GEO sample submissions, not over the ARCHS4-quantified subset — some rows (e.g. cultured mouse embryonic stem cells) appear here as metadata-only entries.

BROADER FILTER COVERAGE — samples matching the tissue family

Blood (all)364,836

T cell240,626

Stem cell162,159

Breast110,014

Bone marrow69,950

Lung63,624

Brain57,305

Skin60,215

Liver39,987

iPSC39,941

Assay mix 36,797 studies

36.8k

studies

Bulk RNA-seq26,49472.0%

Single-cell RNA-seq8,09522.0%

Other2,2086.0%

Both bulk and scRNA-seq are gene-level — single-cell samples are pseudobulk-equivalent in this corpus; cell-level resolution (UMAPs, cluster labels) lives in supplementary files we don't ingest.

Beyond transcriptomics epigenomic co-assays · text-matched

30,592

studies mention ChIP-seq

30,463

studies mention ATAC-seq

8,914

studies mention WGBS / methylation

ROADMAP These don't yet have a dedicated assay_type bucket — they're currently surfaced via free-text and relevance_tags. First-class epigenomics support lands in pipeline v3.6.

WHERE THIS FITS

Single-cell atlases (CellxGene, HCA, Tabula Sapiens) are the right home for cross-tissue cell-type reference work. The GEO Index is the complement — 20 years of human bulk and gene-level scRNA-seq across clinical-trial cohorts, perturbation screens, and toxicology that sc-atlases don't ingest and that legacy GEO can't query.

07 · Built for agents

Built to scan. Built to drill.

Most data products force agents into one resolution — a 5-line summary or a 50-page dump. The Devano MCP surface lets an agent orient in a couple hundred tokens, search in fifty tokens per result, open any study in a few hundred more, and trace any label back to the raw submission text. Breadth and provenance from the same handful of calls.

Claude · Devano MCP

geo−index

One MCP. Four resolutions. Orient → search → open → audit.

01 · Orient

corpus_overview

~200 tokens

What's in here?

02 · Search

search_studies

~50 / result

Find what fits.

03 · Open

study_detail

~500+ / record

Inspect a study.

04 · Audit

sample_evidence

raw → label trace

Show me why.

Four storage layers feed every answer — raw GEO submissions, AI-extracted structure, validated ontology-mapped calls, and pre-computed indices. Progressive disclosure by design: an agent pulls only the resolution it needs.

METHODOLOGY

Headline counts emit from corpus_overview against the three-layer DuckDB build at pipeline geo-index-v3.5, snapshot 2026-04-30. Numbers refresh after each pipeline rebuild — typically monthly. Series are classified by dominant organism; Hs series may contain mixed-organism samples for xenograft and comparative designs.

SOURCES

The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository. Genome Biology, 2025. doi:10.1186/s13059-025-03725-0
Hu & Wang. Mining data and metadata from the gene expression omnibus. Biophysical Reviews 10(6), 2018. doi:10.1007/s12551-018-0490-8
Lim, Tesar, Belmadani, et al. Curation of over 10,000 transcriptomic studies to enable data reuse. Database, 2021. doi:10.1093/database/baab006 · Gemma curation; sex-label discrepancies and sample-field completeness statistics.
Nault R, Cave MC, Ludewig G, Moseley HNB, Pennell KG, Zacharewski T. A Case for Accelerating Standards to Achieve the FAIR Principles of Environmental Health Research Experimental Data. Environmental Health Perspectives 131(6):065001, 2023. doi:10.1289/EHP11484 · Assessment of 1,233 in vivo GEO toxicology datasets against the MIATE/invivo minimum-reporting standard; reference repository at github.com/zacharewskilab/MIATE.
Subramanian, Narayan, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171(6):1437–1452, 2017. doi:10.1016/j.cell.2017.10.049