Small Variants¶
vflank small extracts flanking sequence around SNPs/indels from a MAF and
writes a FASTA suitable for ddPCR assay design.
Input¶
A tab-separated MAF (TCGA/MSK style) or a VCF/BCF. The two are
auto-detected by file extension, so vflank small run takes either:
Required columns: Chromosome, Start_Position, End_Position,
Reference_Allele, Tumor_Seq_Allele2. Optional metadata used in headers:
Hugo_Symbol, HGVSp_Short, HGVSc. Column names can be remapped with
--chrom-col, --start-col, … if your file differs.
.vcf, .vcf.gz, or .bcf (auto-detected). Read sites-only — sample
genotypes are ignored. Each record's anchor-base REF/ALT is normalised to
the MAF [Start, End] convention; multi-allelic records expand to one
record per ALT; symbolic / SV / BND alleles (<DEL>, A[2:123[, *)
are skipped (this is the small-variant path — SV-VCF is a separate feature).
Gene and HGVS are pulled best-effort from a VEP CSQ or SnpEff ANN INFO
field when present, and left blank otherwise. The column-remap flags don't
apply to VCF. No index is needed (the file is read sequentially).
Chromosome notation (chr7 vs 7) is auto-detected and normalised, including
numeric (23→X) and float (17.0→17, common when a column has blanks) forms.
vflank small inspect variants.maf # preview columns + flag missing fields
vflank small inspect variants.vcf.gz # preview the normalised small variants
Run¶
vflank small run variants.maf \
--ref-genome GRCh37.fasta \
--genome-build hg19 \ # (1)!
--pop-vcf-dir gnomad_v2.1.1/ \ # (2)!
--flank 200 \ # (3)!
--output flanking_sequences.fasta \
--report run_report.tsv # (4)!
- Guards against an hg19-vs-hg38 mix-up — vflank checks the FASTA's chr1 length against this build and warns on mismatch.
- Optional SNP masking from local gnomAD VCFs. Swap for
--pop-source apito use the gnomAD GraphQL API instead (no download). - Bases taken from each side of the variant (so each record is up to
2 × flankbp, shorter at a contig end). - Optional machine-readable run summary: per-variant masked/corrected counts, skips grouped by reason, and the full parameter set.
The same command takes a VCF — only the input path changes (sites-only;
--samples doesn't apply and is ignored with a warning):
No local FASTA? Use --ref-source api
--ref-source api fetches each flank window from the UCSC API instead of a
local FASTA, so --ref-genome becomes optional — handy for one-off or hosted
runs with no reference on disk. It is throttled (~1 request/second), so it
suits small inputs; use a local FASTA for bulk. Note the chr1-length build
guard applies only to a local FASTA; with the API the requested
--genome-build is trusted (a wrong build surfaces as a UCSC error, not
silent wrong sequence).
Output¶
Two records per unique variant — raw and Masked__ — with the variant shown
literally as [REF/ALT] between the flanks. The header is keyed on the variant
identity, not the sample:
Reading a record¶
A masked flank is just the reference sequence with the variant shown literally
and common SNPs swapped for N. Highlighting only the parts vflank touches (the
rest is untouched, designable reference):
AGCGATCGATCGTACGT[T/C]ACGTGCANTCGATCGTAGC
…where [T/C] is the variant of interest and N marks a masked common SNP (gnomAD AF ≥ threshold) that a primer/probe must avoid.
Before → after masking — what the Masked__ record changes versus the raw
record (one common SNP, here, replaced):
Anatomy of a record, left to right:
left flank (5′)--flankbases of reference ending just before the variant.[REF/ALT]- the variant itself, written literally; excluded from both flanks.
right flank (3′)--flankbases starting just after the variant — maskedNwherever a common SNP would sit under a primer/probe.
Deduplication¶
The same variant seen across multiple samples collapses to one record
(flank + mask are sample-independent for reference/population masking). Use
--no-dedup to emit one record per row instead. The run summary reports how
many duplicates were collapsed.
Emit for Primer3¶
Add --emit-primer3 primers.txt to also write a Primer3
Boulder-IO input file — one record per variant, ready to hand to a designer:
Each record carries:
SEQUENCE_TEMPLATE— the best-known sequence (the masked/consensus call, falling back to the reference base where the call isN).SEQUENCE_TARGET— the variant span, so the assay covers it.SEQUENCE_EXCLUDED_REGION— the masked positions (common SNPs, patient het/low-cov/insertion sites). This is a hard "no oligo here" constraint — stronger than a degenerateN, which Primer3 may still design over.
SEQUENCE_ID matches the FASTA header key, so the two outputs cross-reference.
The same flag works on vflank fusion run (one record per junction, targeted to
span the breakpoint). Olivar emit is planned.
Safety nets¶
- Genome-build guard — if the FASTA's chr1 length disagrees with
--genome-build, vflank warns (catches hg19-vs-hg38 mix-ups). - Flank truncation — flanks that run off a contig end are emitted but reported, never silently shortened.
- Skip summary — invalid/incomplete rows (e.g. a missing
Chromosome) are skipped and grouped by reason, with examples and a full list in--report.
Sample filtering¶
--samples "P-001,P-002" # comma-separated barcodes
--samples-file ids.txt # one ID per line (# comments allowed)
See SNP Masking for --pop-source / --pop-data.