vflank — Architecture & Roadmap¶

What vflank is (and is not)¶

vflank is the variant-aware, optionally patient-specific, masked-flank front-end of a ddPCR assay-design pipeline. It extracts the sequence flanking each variant and masks positions that would compromise a primer/probe, then emits clean targets in designer-native formats.

It is not a primer designer. Design is delegated downstream:

Path	Designer	Why
Small variants (SNP/indel)	Olivar (SADDLE, risk-array)	Built for variant-aware multiplex/tiled amplicons.
Fusion junction	Primer3	Single junction-spanning hybridisation probe.

Olivar is GPL-3.0 and pulls in BLAST/MAFFT/NumPy<2, so it is invoked out-of-process (CLI / container), keeping vflank permissively licensed and the dependency graph clean. Orchestration across samples/tools is a Nextflow layer that wraps the stable CLIs in per-process containers — added last, never a prerequisite for the standalone CLI.

flowchart LR
    REF[("ref FASTA")] --> VF
    POP[("gnomAD VCFs")] --> VF
    BAM[("sample BAM
optional")] --> VF
    subgraph VF["vflank front-end · flank + mask"]
        SV["small variants
FASTA + SNP CSV"]
        FU["fusion junction"]
    end
    SV --> OL["Olivar"] --> DD["ddPCR assays"]
    FU --> P3["Primer3"] --> DD

Use mouse to pan and zoom

Flank source strategy (modes)¶

A FlankSource decides where each flank base comes from:

Mode	Inputs	Source	Masking
A Reference	MAF + FASTA	reference	none
B Reference + pop-mask	+ gnomAD	reference	common SNPs → N
C Consensus	+ BAM	patient consensus, ref fallback	het / low-cov → N/IUPAC
D Consensus + pop-mask	all	patient consensus	gnomAD ∪ observed-het

The mode is selected implicitly by which inputs are present:

flowchart TD
    V["flank window"] --> Q1{"--bam given?"}
    Q1 -->|no| Q2{"gnomAD given?"}
    Q1 -->|yes| C["patient consensus
het / low-cov → N / IUPAC"] --> Q3{"gnomAD given?"}
    Q2 -->|no| A["Mode A — reference, no mask"]
    Q2 -->|yes| B["Mode B — reference + SNP mask"]
    Q3 -->|no| MC["Mode C — consensus only"]
    Q3 -->|yes| MD["Mode D — consensus + SNP mask"]

Use mouse to pan and zoom

Mode C/D is the differentiator: patient consensus catches private/rare variants gnomAD never sees — the ones that silently break a primer for one patient. Consensus is delegated to samtools consensus (in-process via pysam.samtools), kept reference-length, with a pure low-coverage overlay and insertion-site flagging on top. See research/bam-consensus.md.

All four modes are implemented (ReferenceFlankSource for A/B, ConsensusFlankSource for C/D). The population-mask backend is pluggable behind a duck-typed get_positions interface, with two interchangeable implementations selected by --pop-source: GnomadStore (local VCFs) and GnomadApiSource (gnomAD GraphQL API, no download, rate-limited). Both honour --pop-data {genome,exome,both} (union for both). See research/gnomad-api.md.

The reference source is likewise pluggable, selected by --ref-source: ReferenceFasta (local indexed FASTA, default) and ReferenceApiSource (UCSC getData/sequence API, no download — for hosted/small runs where shipping a FASTA is impractical). Both expose the fetch(bare, s0, e0) window contract ReferenceFlankSource depends on. See research/genome-api.md.

Module map¶

src/vflank/
├── core/
│   ├── chrom.py       notation detect/normalise (pure)
│   ├── variant.py     Variant dataclass + validation (pure)
│   ├── flanks.py      FlankSource protocol, ReferenceFlankSource, mask_sequence
│   ├── popfreq.py     gnomAD VCF resolve + parse_common_snp_positions (pure) + GnomadStore
│   ├── popfreq_api.py gnomAD GraphQL API source (GnomadApiSource) + pure parser
│   ├── reference_api.py reference sequence via UCSC API (ReferenceApiSource) + pure parser
│   ├── consensus.py   BAM patient consensus (modes C/D): samtools engine, pure
│   │                  low-cov overlay, insertion-site flagging
│   ├── fusion.py      Breakpoint/Fusion model, reverse-complement junction builder
│   └── skips.py       categorised skip/he reporting helpers (pure)
├── io/
│   ├── maf.py         load/remap/validate (path or buffer), row → Variant
│   ├── reference.py   ReferenceFasta + genome-build guard
│   ├── fasta.py       header sanitise + record format/write
│   ├── breakpoints.py SV/fusion breakpoint-TSV reader (path or buffer)
│   ├── emit_primer3.py Primer3 Boulder-IO writer (EmitRecord -> SEQUENCE_* tags)
│   └── report.py      TSV run-report writer (flexible columns)
├── sources.py        reference/gnomAD source factories from config (validate + build)
├── pipeline.py       use case: iter_small/iter_fusion + collect + run_small/run_fusion
├── cli/
│   ├── app.py         root Typer, global -v/-q/--debug, version
│   ├── small.py       run · inspect · list-vcf (+ masking, BAM, coverage flags)
│   ├── fusion.py      run (+ masking, BAM flags)
│   ├── _bam.py        --bam/--bam-map resolver + ConsensusPolicy builder
│   └── _ui.py         parameter-echo panel
├── logging.py         Rich console + logger
└── errors.py          VflankError hierarchy

The hot kernels (parse_common_snp_positions, mask_sequence, the consensus overlay) are pure functions over plain iterables — the natural seam to later accelerate with Rust (rust-htslib for consensus parity with samtools; noodles for pure-Rust gnomAD scanning). Lock correctness in Python first; port the proven bottleneck only.

The layering enforces a one-way dependency rule — core/ stays pure so its kernels remain unit-testable without pysam and portable to Rust:

flowchart LR
    cli["cli/
Typer apps"] -->|uses| io["io/
file access · pysam · pandas"]
    cli -->|uses| core["core/
pure domain kernels"]
    io -->|uses| core
    core -->|depends on| std(["stdlib only"])

Use mouse to pan and zoom

Origin¶

vflank began as two scripts: get_flanking_sequence.py (small variants, fully ported in M2 with the output format preserved) and design_fusion_primers.py + config_ES_CTDNA_03.cfg (a legacy Python-2 fusion script that didn't run and embedded a junction-corrupting "-"). Both have been ported / re-implemented and removed; the conventions recovered from them (the corrected fusion-junction model, iCallSV strand mapping) are captured in research/sv-vcf-input.md.

Roadmap¶

Shipped through v0.2.0 (on PyPI + GHCR):

M	Deliverable	State
M1	Scaffold (src-layout, pyproject/Hatch, Typer root, logging)	✅ done
M2	Small-variant port (behaviour-preserving, testable)	✅ done
M3	Tests + provenance (unit + integration, build guard)	✅ done
M4	Fusion rewrite (Python 3, breakpoint-TSV input)	✅ done
M4.5	BAM consensus (modes C/D) + insertion flagging	✅ done (0.2.0)
M5	MkDocs (Material + mkdocstrings, versioned via mike)	✅ done
M6	Release (PyPI via OIDC; GHCR image)	✅ done (0.1.0, 0.2.0)

Next, in priority order (each has a design note in research/):

Next	Deliverable	Design note
M4.5-emit	`--emit olivar,primer3` designer-native output formats	emit-formats.md
M-vcf	VCF input — small-variant VCF ✅ done; Delly `CT`/`BND` SV VCF next	sv-vcf-input.md
M-indel	Indel-aware (length-changing) consensus — option 3	indel-aware-consensus.md
M7	Nextflow / nf-core pipeline (containerised)	—
M-rust	Port the proven hot kernels to Rust (optional)	—

Top risks¶

Reference build mismatch → silent wrong sequence — guarded in reference.py.
pysam wheel fragility — CI matrix Linux+macOS; document WSL/conda.
Indel coordinate edge cases — insertion flagging shipped; length-changing consensus is deferred to M-indel with its own test matrix.
Olivar input-schema drift — pin a version and assert columns in the emitter (M4.5-emit); the dependency itself stays out-of-process (GPL isolation).