gnomAD GraphQL API — findings, limitations, and design note¶
Investigation of whether/how to use the public gnomAD API as an alternative
population-frequency masking source (vs downloading per-chromosome VCFs).
Status: implemented — core/popfreq_api.GnomadApiSource, selected via
--pop-source api. The VCF source (core/popfreq.GnomadStore) and the API
source are symmetric across --pop-data {genome,exome,both}.
All facts below were verified against the live endpoint and the gnomAD docs on 2026-06-12.
Summary / recommendation¶
Use the public hosted endpoint https://gnomad.broadinstitute.org/api
(GraphQL over HTTP POST) as an optional masking source, selected with
--pop-source api. Keep the local-VCF source as the default for reproducible /
bulk / offline (HPC) runs. The API is ideal for small cohorts with no local
data (e.g. our 20-variant TP53 test); it is not suitable for large cohorts
because of a hard rate limit (below).
No new runtime dependency: use the standard-library urllib.request.
requests is unnecessary; fastapi is irrelevant (it builds servers — we are a
client).
Endpoint and protocol¶
- URL:
https://gnomad.broadinstitute.org/api - Method: HTTP
POST,Content-Type: application/json, body{"query": "...", "variables": {...}} - GraphQL errors return HTTP 200 with an
errorsarray in the body — must be checked explicitly, not just by status code. - Server-side caching present (
x-cached: HIT/MISSresponse header).
Datasets and genome builds (verified)¶
DatasetId enum includes: gnomad_r4, gnomad_r4_non_ukb, gnomad_r3 (+subsets),
gnomad_r2_1, gnomad_r2_1_controls/_non_neuro/_non_cancer/_non_topmed, exac.
Mapping for vflank:
vflank --genome-build |
reference_genome |
dataset |
|---|---|---|
hg19 (GRCh37) |
GRCh37 |
gnomad_r2_1 |
hg38 (GRCh38) |
GRCh38 |
gnomad_r4 |
reference_genome is the ReferenceGenomeId enum (GRCh37 / GRCh38). The
chrom argument takes the bare name ("17") for both builds — matches our
canonical internal form.
Query shape (region) and coordinates¶
{
region(chrom: "17", start: 7579400, stop: 7579540, reference_genome: GRCh37) {
variants(dataset: gnomad_r2_1) {
pos ref alt
exome { af populations { id ac an } }
genome { af populations { id ac an } }
}
}
}
region(start, stop)is 1-based, inclusive; returnedposis 1-based. Our masking interface uses 0-based half-open[s0, e0)→ querystart = s0 + 1,stop = e0; returnedposvalues are used directly (they're whatmask_sequenceexpects).
Schema facts that matter (introspected)¶
Variant:pos, ref, alt, exome, genome, joint, rsids, flags, ….exome/genomeareVariantSequencingTypeData: hasaf(Float),ac,an,populations,faf95,fafmax.afIS available at this level.VariantPopulation:id, ac, an, homozygote_count, …— noaffield; the per-population AF must be computed asac / an(guardan > 0).joint(VariantJointSequencingTypeData) exists only in v4 (combined exome+genome); it hasac/an/populations/fafmaxbut no directaf(computeac/an). For v2.1.1jointis null.- v4 uses
fafmaxfor the filtering-AF max; there is nogrpmaxfield (an early guess that the API corrected).
AF rule for masking (matches our VCF behaviour of max(AF, AF_grpmax))¶
For each variant, SNPs only (len(ref) == 1 and len(alt) == 1):
af_overall = max of (genome.af, exome.af) ignoring None
af_grpmax = max over (genome.populations + exome.populations) of ac/an, an>0
af = max(af_overall, af_grpmax)
mask pos if af >= af_threshold
exome.af = 0.668, population-max = 0.738 — correctly flagged as common.
Limitations (the important part)¶
- Rate limit: 10 requests / IP / 60 s (gnomAD policy; README confirms IP-level
rate limiting in
rate-limiting.ts). Exceeding it returns errors / blocks the IP. This is the dominant constraint. - Per-region variant cap. A region with too many variants returns
"This region has too many variants to display. Select a smaller region."Irrelevant for our ±200 bp windows; relevant if anyone tried whole-gene queries. - Not for bulk. gnomAD explicitly directs large/programmatic workloads to the VCF/Hail-table downloads or the gnomAD Toolbox, not the API.
- Reproducibility. Results are not pinned — gnomAD updates the served data;
network is required (no offline/HPC compute nodes). For a clinical ddPCR context,
record the
datasetid in the run report for provenance. - Schema is "subject to change" (README) — pin the query and add a schema-shape check / clear error if fields go missing.
- Coverage gaps. In coding regions,
genome.afis oftenNone(exome-only coverage); themax-ignoring-Nonerule handles this, but it means genome-only masking would miss exonic SNPs — use both exome and genome.
Rate-limit math (why source choice matters)¶
10 req/min, and our interface calls get_positions twice per variant (left + right
flank). Mitigations bring this down:
- One query per variant window: pad each region query to cover both flanks of
a variant ([start-flank-1, end+flank]) and cache as intervals, so the sibling
flank call is a cache hit → ~1 request/variant.
- Region cache: identical/overlapping variants (e.g. our 20 identical TP53
rows) collapse to a single request.
- Even so: ~10 distinct variant windows/min. ≤~50 distinct variants is comfortable;
hundreds means minutes of throttled waiting → prefer local VCF.
--pop-data {genome,exome,both} across both sources¶
--pop-data selects which gnomAD dataset(s) to consult; the masking rule is the
same regardless of source.
- API source: free — one region query returns both
genome{}andexome{};both=max(genome.af, exome.af, per-pop ac/an). - VCF source: maps to which file set is opened in
--pop-vcf-dir: genome→gnomad.genomes.*(the current behaviour / what the original tool used)exome→gnomad.exomes.*both→ open both per chromosome, mask the union of positions (a position is masked if AF ≥ threshold in either cohort).
gnomAD ships per-chromosome files for both genomes and exomes, both builds (verified): e.g. GRCh37 chr17 genomes 14.5 GB, exomes 3.9 GB (a 63 GB combined exome file also exists but we use per-chrom). So the VCF source can support all three — it just needs the corresponding files present.
Asymmetry to respect: the API gets all three for free; the VCF source needs the
files downloaded. If --pop-data exome/both is requested on the VCF source
without the required files, raise a clear error — never silently fall back to
genome-only. Today the package ships only genome patterns; adding exome
patterns (per-chrom, both builds) + union logic is required to enable
exome/both on the VCF source.
both combine = max AF across cohorts (conservative, primer-safe). The
statistically pooled combine is gnomAD's joint dataset, which exists only for
v4 (GRCh38) — a future v4 refinement; ship max-based both first.
Filename patterns needed (split genome/exome):
hg19 genome: gnomad.genomes.r2.1.1.sites.{chrom}.vcf.bgz
hg19 exome: gnomad.exomes.r2.1.1.sites.{chrom}.vcf.bgz
hg38 genome: gnomad.genomes.v4.1.sites.chr{chrom}.vcf.bgz
hg38 exome: gnomad.exomes.v4.1.sites.chr{chrom}.vcf.bgz
Proposed design (implementation pending)¶
- New
core/popfreq_api.py::GnomadApiSourceimplementing the same duck-typed interface asGnomadStore:get_positions(bare, start0, end0, af_threshold) -> list[int]. Drops in behindReferenceFlankSourcewith no core changes. - CLI:
--pop-source {vcf,api}(defaultvcf).apiignores--pop-vcf-dir. - Build→(reference_genome, dataset) mapping as above;
--gnomad-datasetoverride for subsets (e.g.gnomad_r2_1_non_cancer). - HTTP via
urllib.request; 30 s timeout; client-side throttle ≤10/60 s (token bucket); exponential backoff + retry on transient/429; raisePopFreqErrorwith a clear message on hard failure (no silent empty mask). - Region-interval cache (pad to cover both flanks) to minimise requests.
- Record dataset id + "source=api" in the run report for provenance.
- Up-front guard: if the distinct-variant count is large, warn and suggest
--pop-source vcf.
Test plan¶
- Unit (no network):
build_to_dataset(genome_build)mapping.- Response→positions parser: given saved JSON fixtures (real API responses for a
common-SNP region and a rare-only region), assert correct SNP/AF filtering,
af = max(overall, ac/an grpmax), indels excluded,NoneAF handled. - Coordinate conversion (
[s0,e0)→start=s0+1, stop=e0). - Throttle/backoff logic with a fake clock and a fake transport (no real HTTP).
- Error mapping: GraphQL
errorsarray and HTTP failure →PopFreqError. - Fixtures: commit 2–3 small real JSON responses under
tests/fixtures/gnomad_api/(captured once; <5 KB each) so tests are deterministic and offline. - Integration: the existing
ReferenceFlankSourcemasking test, but with a stub source object exposingget_positions— confirms the API source is interchangeable withGnomadStore(no real network in CI). - Optional live smoke test: one test hitting the real API, marked
@pytest.mark.liveand skipped by default (run manually); keeps CI offline and polite to the shared resource.
Docs plan¶
docs/DEVELOPER.md: add--pop-source apito the CLI section and a "new frequency source" note (it already documents the duck-typed interface).docs/ARCHITECTURE.md: add the API source to the masking modes / popfreq seam.ddpcr-conventionsskill: add the AF rule (max of overall + per-pop ac/an) and the API-vs-VCF tradeoff so future work reasons about it correctly.README.md: one line under masking — "no-download option via the gnomAD API for small cohorts."- User-facing guidance everywhere: API = small/no-download; VCF = bulk/reproducible/HPC.
Resolved decisions¶
- Default genome build =
hg19(GRCh37 / gnomAD v2.1.1). gnomAD v4 is GRCh38-only — there is no v4 GRCh37. This deployment's data isassembly19(hg19 ctDNA) and the existing tool is hg19, so hg19 is the right default. CLI default (small run,list-vcf) changed hg38 → hg19; hg38 users pass-g hg38and the build-mismatch guard catches a wrong choice. - Default population data =
genome, exposed as--pop-data {genome,exome,both}. Flanks routinely fall in non-coding regions where exomes have no data; genomes cover the whole genome uniformly; and this matches the existing genomes-VCF behaviour. Verified on TP53 P72R (GRCh37): genome af=0.621 (AN 31k), exome af=0.668 (AN 251k) — both agree it's common. Exome's larger N only helps for rare coding variants near the threshold;both(union/max) is the safest near exons. - AF within the chosen dataset:
af = max(seq.af, max(ac/an over populations)), SNPs only — mirrors the VCF source's max(AF, AF_grpmax). - Default
--pop-source = vcf(api opt-in). Both sources default to genome for parity. - Fixed safe throttle (≤10 req/60 s + backoff); no
--api-rateknob initially.