API Reference¶
vflank can be used as a library. The pieces compose directly — see the Developer Guide for runnable examples. Documentation is generated from the source docstrings.
Core¶
vflank.core.variant¶
vflank.core.variant ¶
The :class:Variant value object and per-variant validation.
Decoupling the pipeline from pandas rows (the original script iterated
df.iterrows() directly) makes the core logic testable with plain objects.
Variant
dataclass
¶
A single small variant in canonical internal form.
Coordinates are 1-based, fully-closed [start, end] (MAF convention).
chrom is the bare chromosome; raw_chrom preserves the original
display notation for headers/messages only.
Source code in src/vflank/core/variant.py
validate_allele ¶
True if an allele is a valid base string (or the '-'/'.' placeholders).
validate_coordinates ¶
Return an error message if coordinates are invalid, else None.
Source code in src/vflank/core/variant.py
vflank.core.chrom¶
vflank.core.chrom ¶
Chromosome notation detection and normalisation.
All functions here are pure (no I/O) so they are trivially unit-testable. The
canonical internal form is the bare chromosome string (e.g. "7", "X",
"MT"); callers convert to chr-prefixed form for a given FASTA/VCF at the
last moment via :func:chrom_for_contigs.
normalise_chrom ¶
Convert a raw chromosome value to a canonical bare chrom string.
Handles NaN/None/empty, any case of a chr prefix, bare integers, numeric
sex/mito encodings (23->X, 24->Y, 25/26->MT), and mito aliases (M->MT).
Returns (bare_chrom, error). If error is not None the value could not
be normalised and bare_chrom is None (the variant should be skipped).
Source code in src/vflank/core/chrom.py
contigs_have_chr ¶
Return True if any contig in an iterable is chr-prefixed.
Probes chr1/1 first, then falls back through chr2-chr5.
Defaults to True (assume prefixed) when undeterminable.
Source code in src/vflank/core/chrom.py
chrom_for_contigs ¶
detect_series_chr_style ¶
Inspect an iterable of raw chromosome values for reporting.
Returns True (chr-prefixed), False (bare), or None (unknown/mixed/empty).
Source code in src/vflank/core/chrom.py
vflank.core.flanks¶
vflank.core.flanks ¶
Flank extraction and masking.
A :class:FlankSource is the strategy seam for where each flank base comes
from. The reference-backed source implemented here covers modes A (reference)
and B (reference + population mask). Mode C/D (BAM consensus) will add a
ConsensusFlankSource implementing the same protocol.
FlankResult
dataclass
¶
Left/right flanks, raw and masked, for one variant.
Source code in src/vflank/core/flanks.py
FlankSource ¶
ReferenceFlankSource ¶
Flanks pulled from the reference FASTA, optionally masking common SNPs.
MAF coordinates are 1-based fully-closed [start, end]; pysam uses 0-based
half-open [start, end). The left flank is the flank bases ending just
before the variant; the right flank is the flank bases starting just
after it. The variant interval itself is excluded from both flanks.
Source code in src/vflank/core/flanks.py
mask_sequence ¶
Replace bases at the given 1-based genomic positions with 'N'.
seq[0] corresponds to genomic position region_start_0based + 1.
Source code in src/vflank/core/flanks.py
vflank.core.popfreq¶
vflank.core.popfreq ¶
Population allele-frequency masking source (gnomAD), local-VCF backend.
Masking can draw on gnomAD genome and/or exome data (--pop-data):
flanks often fall in non-coding regions where only genomes have data, while
exomes add power in coding regions. both masks the union (a position is
masked if it is a common SNP in either cohort).
The hot line-parsing kernel (:func:parse_common_snp_positions) is a pure
function over an iterable of raw VCF lines, so it is unit-testable without pysam
and is the natural seam to later swap for a Rust/noodles implementation.
GnomadStore ¶
Lazy, per-(kind, chromosome) tabix cache over a directory of gnomAD VCFs.
Honours --pop-data (genome / exome / both). For both, queried
positions are the union across the two cohorts. Tabix handles are opened on
first use; each file's chr-notation is detected on open so queries use the
file's own contig names.
Source code in src/vflank/core/popfreq.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | |
preflight ¶
Fail fast if a requested data kind is wholly absent for these chroms.
This is what prevents a silent genome-only fallback when --pop-data
exome/both is requested without the exome files present. Per-chrom
gaps (some chromosomes missing) remain warnings at query time.
Source code in src/vflank/core/popfreq.py
get_positions ¶
1-based positions of common SNPs in [start, end) — union over kinds.
Source code in src/vflank/core/popfreq.py
kinds_for ¶
Map a --pop-data value to the data kinds it consults.
genome/exome -> that one; both -> both. Raises on anything else.
Source code in src/vflank/core/popfreq.py
resolve_vcf_for_chrom ¶
Return the first existing gnomAD file for this chrom/build/kind, else None.
Source code in src/vflank/core/popfreq.py
example_filename ¶
A representative expected filename (chr1), for error/help messages.
Source code in src/vflank/core/popfreq.py
build_chrom_vcf_map ¶
Resolve the VCF (of one kind) for each chromosome (for coverage reports).
Source code in src/vflank/core/popfreq.py
parse_common_snp_positions ¶
Return 1-based positions of SNPs whose max AF/AF_grpmax >= threshold.
Only single-base substitutions are considered (REF and every ALT length 1).
rows is an iterable of raw tab-delimited VCF data lines. Works for both
genome and exome gnomAD VCFs (identical INFO AF fields).
Source code in src/vflank/core/popfreq.py
vflank.core.popfreq_api¶
vflank.core.popfreq_api ¶
Population allele-frequency masking source via the gnomAD GraphQL API.
An alternative to :class:~vflank.core.popfreq.GnomadStore that needs no local
VCF download — it queries https://gnomad.broadinstitute.org/api per flank region.
Exposes the same duck-typed get_positions interface, so it drops in behind
ReferenceFlankSource unchanged.
Trade-offs (see docs/research/gnomad-api.md): no download and both builds, but rate-limited to ~10 requests/IP/60s and not reproducible — best for small cohorts; prefer the VCF source for bulk/HPC/reproducible runs.
The parsing kernel (:func:parse_api_variants) is pure and unit-testable; HTTP
and timing are injected so tests run offline.
GnomadApiSource ¶
Masking source backed by the public gnomAD GraphQL API.
Region responses are cached (so the two flank queries of identical variants
reuse one request), requests are throttled to respect the rate limit, and
transient failures are retried with backoff before raising PopFreqError.
Source code in src/vflank/core/popfreq_api.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | |
get_positions ¶
1-based positions of common SNPs in [start, end) for this chrom.
Source code in src/vflank/core/popfreq_api.py
build_query ¶
GraphQL query for SNP AFs in a region. start/stop are 1-based inclusive.
Source code in src/vflank/core/popfreq_api.py
parse_api_variants ¶
1-based positions of SNPs whose max AF (over kinds) >= threshold.
Pure: variants is the GraphQL region.variants list. SNPs only.
Source code in src/vflank/core/popfreq_api.py
vflank.core.reference_api¶
vflank.core.reference_api ¶
Reference-sequence source backed by the UCSC REST API.
An alternative to :class:~vflank.io.reference.ReferenceFasta that needs no
local FASTA download — it fetches each flank window from
https://api.genome.ucsc.edu/getData/sequence. Exposes the surface the CLIs
use on a reference (fetch / check_build / has_chr / close), so it
drops in behind :class:~vflank.core.flanks.ReferenceFlankSource unchanged.
Why UCSC (see docs/research/genome-api.md): its genome ids are literally
hg19 / hg38 (our --genome-build values) and its coordinates are
0-based half-open — identical to pysam — so the flank math in flanks.py needs
no translation. Trade-offs: no download and both builds, but a ~1 req/s courtesy
rate limit and network required — best for the hosted single-variant / small
use; prefer a local FASTA for bulk/HPC/offline runs.
The parsing kernel (:func:parse_sequence_response) and URL builder
(:func:build_url) are pure and unit-testable; HTTP and timing are injected so
tests run offline.
ReferenceApiSource ¶
Reference-sequence source backed by the UCSC getData/sequence API.
Window responses are cached, requests are throttled to respect UCSC's
courtesy limit (~1 req/s), and transient failures are retried with backoff
before raising ReferenceError. Drop-in for :class:ReferenceFasta across
the surface the CLIs use.
Source code in src/vflank/core/reference_api.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
fetch ¶
Reference bases for [start, end) (0-based half-open), bare chromosome.
Source code in src/vflank/core/reference_api.py
check_build ¶
No local sequence to fingerprint; the API serves the requested build.
Returns None (no mismatch warning is possible). We trust declared
and pass it through as the UCSC genome; a wrong build still surfaces
downstream as a UCSC error rather than silent wrong sequence.
Source code in src/vflank/core/reference_api.py
ucsc_contig ¶
Bare chromosome -> UCSC contig name.
UCSC hg19/hg38 use chr-prefixed names, and the mitochondrion is chrM
(not chrMT). We cannot cheaply probe the contig list over the API, so we
apply UCSC's known convention rather than guessing.
Source code in src/vflank/core/reference_api.py
build_url ¶
UCSC getData/sequence URL. UCSC coords are 0-based half-open — same as ours.
Source code in src/vflank/core/reference_api.py
parse_sequence_response ¶
Extract the sequence from a UCSC getData/sequence JSON body.
Pure. Raises :class:ReferenceError on an API error payload or a missing/
non-string dna field. A shorter-than-requested sequence is not an
error here — like a pysam fetch off a contig end, truncation is a real
condition the caller reports (see the flank-truncation check in the CLIs).
Source code in src/vflank/core/reference_api.py
vflank.core.fusion¶
vflank.core.fusion ¶
Fusion / SV junction construction from breakpoint pairs.
Builds the chimeric junction sequence for a fusion so a ddPCR probe can span it.
Strand convention (matches iCallSV dellyVcf2Tab.py): 0 = plus/reference,
1 = minus/complement. The fused product reads 5'->3' as partner1 +
partner2 with no separator; partner 1 ends at the junction, partner 2
starts at it. See docs/research/sv-vcf-input.md for the full derivation
(validated against the unambiguous deletion case).
Breakpoint
dataclass
¶
Fusion
dataclass
¶
build_junction ¶
Construct the fusion junction sequence (partner1 + partner2).
flank is the bases taken from each partner, so the junction is up to
2*flank bp (shorter if a partner runs off a contig end). The probe is
designed to span junction_index. masked_sequence is the gnomAD-masked
junction, or — when bam_source is given — the per-sample patient consensus
(built in genomic space before reverse-complement).
Source code in src/vflank/core/fusion.py
vflank.core.skips¶
vflank.core.skips ¶
Categorise per-variant skip reasons for a compact run summary.
The reasons are free-text (they come from several validation points), so this buckets them by keyword into a small, stable set of categories. Pure and testable.
categorize_skip ¶
Map a free-text skip reason to a stable category label.
I/O¶
vflank.io.maf¶
vflank.io.maf ¶
MAF loading, column remapping/validation, and row -> Variant parsing.
MafColumns
dataclass
¶
User-overridable mapping from MAF column names to canonical names.
Source code in src/vflank/io/maf.py
read_maf ¶
Read a MAF into a DataFrame (tab-separated, '#'-comment aware).
path is a filesystem path or an open text/binary buffer.
Source code in src/vflank/io/maf.py
load_maf ¶
Read a MAF, remap required columns to canonical names, and validate.
Returns the DataFrame with canonical required-column names guaranteed present, and optional metadata columns filled with defaults if absent.
Source code in src/vflank/io/maf.py
parse_variant_row ¶
Convert a MAF row to a :class:Variant, or return a skip reason.
Returns (variant, None) on success or (None, reason) on a bad row.
Source code in src/vflank/io/maf.py
vflank.io.reference¶
vflank.io.reference ¶
Reference FASTA access with chr-notation detection and build fingerprinting.
The build-mismatch guard addresses the scariest silent failure in this domain: running hg19 coordinates against an hg38 FASTA (or vice versa) returns the wrong sequence with no error. We fingerprint by the length of chromosome 1.
ReferenceFasta ¶
Thin wrapper over pysam.FastaFile keyed by bare chromosome.
Source code in src/vflank/io/reference.py
contig ¶
Resolve a bare chromosome to this FASTA's actual contig name.
Auto-detection of chr prefixing is best-effort (it probes chr1-5).
If the detected form is absent but the other form is present — which can
happen with unusual or single-contig references — fall back to it rather
than letting pysam raise a confusing KeyError for every variant.
Source code in src/vflank/io/reference.py
detect_build ¶
Infer 'hg19'/'hg38' from chr1 length, or None if undeterminable.
Source code in src/vflank/io/reference.py
check_build ¶
Return a warning string if the declared build disagrees with the FASTA.
Source code in src/vflank/io/reference.py
vflank.io.fasta¶
vflank.io.fasta ¶
FASTA record formatting and writing for the small-variant path.
safe_header ¶
record_id ¶
The shared record key: [{SAMPLE}__]{GENE}__{HGVSp}__{HGVSc}__{CHROM}_{POS}_{REF}_{ALT}.
Used for both FASTA headers and Primer3 SEQUENCE_ID so records cross-reference.
Source code in src/vflank/io/fasta.py
format_records ¶
Two FASTA records per variant: raw and masked.
Keyed on the variant identity (CHR_POS_REF_ALT). When sample is given
(BAM-consensus mode, where the sequence is patient-specific), the sample is
prefixed so per-(variant, sample) records stay distinct:
[{SAMPLE}]{GENE}{HGVSc}{POS}{ALT} {left}[REF/ALT]{right} Masked__[{SAMPLE}__]{GENE}__{HGVSp}__{HGVSc}__{CHROM}{REF} {masked_left}[REF/ALT]{masked_right}
Source code in src/vflank/io/fasta.py
vflank.io.breakpoints¶
vflank.io.breakpoints ¶
Read SV breakpoints from the simple iCallSV / iAnnotateSV TSV.
Columns are matched by header name, not position (SvColumns), so a file
works regardless of column order or extra columns, as long as the named columns
are present. Mirrors io/maf.MafColumns.
SvColumns
dataclass
¶
Logical field -> header column name (all overridable).
Source code in src/vflank/io/breakpoints.py
read_sv_table ¶
Read the TSV into a DataFrame (tab-separated, '#'-comment aware).
path is a filesystem path or an open text/binary buffer.
Source code in src/vflank/io/breakpoints.py
load_sv_table ¶
Read and validate that the required columns exist (by name).
Source code in src/vflank/io/breakpoints.py
parse_fusion_row ¶
Convert one row to a :class:Fusion, or return a skip reason.
Source code in src/vflank/io/breakpoints.py
vflank.io.report¶
vflank.io.report ¶
Write a machine-readable TSV run report alongside the FASTA output.
Aggregate stats and the skip breakdown go in #-comment header lines; the
per-variant table follows as proper TSV. Columns are taken from the row keys
(insertion order), so callers control the columns per run mode.
write_report ¶
Write the run report TSV. Raises OSError on write failure (never silent).