API Reference¶

vflank can be used as a library. The pieces compose directly — see the Developer Guide for runnable examples. Documentation is generated from the source docstrings.

Core¶

`vflank.core.variant`¶

vflank.core.variant ¶

The :class:Variant value object and per-variant validation.

Decoupling the pipeline from pandas rows (the original script iterated df.iterrows() directly) makes the core logic testable with plain objects.

Variant `dataclass` ¶

A single small variant in canonical internal form.

Coordinates are 1-based, fully-closed [start, end] (MAF convention). chrom is the bare chromosome; raw_chrom preserves the original display notation for headers/messages only.

Source code in src/vflank/core/variant.py

@dataclass(slots=True)
class Variant:
    """A single small variant in canonical internal form.

    Coordinates are 1-based, fully-closed ``[start, end]`` (MAF convention).
    ``chrom`` is the *bare* chromosome; ``raw_chrom`` preserves the original
    display notation for headers/messages only.
    """

    chrom: str
    start: int
    end: int
    ref: str
    alt: str
    gene: str = "UNKNOWN"
    protein: str = ""
    cdna: str = ""
    sample: str = "SAMPLE"
    raw_chrom: str = ""

    def __post_init__(self) -> None:
        if not self.raw_chrom:
            self.raw_chrom = self.chrom

validate_allele ¶

validate_allele(allele)

True if an allele is a valid base string (or the '-'/'.' placeholders).

Source code in src/vflank/core/variant.py

def validate_allele(allele: str) -> bool:
    """True if an allele is a valid base string (or the '-'/'.' placeholders)."""
    if not allele or allele in (".", "-"):
        return True
    return all(b in VALID_BASES for b in allele)

validate_coordinates ¶

validate_coordinates(start, end)

Return an error message if coordinates are invalid, else None.

Source code in src/vflank/core/variant.py

def validate_coordinates(start: int, end: int) -> str | None:
    """Return an error message if coordinates are invalid, else None."""
    if start < 1:
        return f"start={start} is < 1"
    if end < start:
        return f"end={end} < start={start}"
    return None

`vflank.core.chrom`¶

vflank.core.chrom ¶

Chromosome notation detection and normalisation.

All functions here are pure (no I/O) so they are trivially unit-testable. The canonical internal form is the bare chromosome string (e.g. "7", "X", "MT"); callers convert to chr-prefixed form for a given FASTA/VCF at the last moment via :func:chrom_for_contigs.

normalise_chrom ¶

normalise_chrom(raw_chrom)

Convert a raw chromosome value to a canonical bare chrom string.

Handles NaN/None/empty, any case of a chr prefix, bare integers, numeric sex/mito encodings (23->X, 24->Y, 25/26->MT), and mito aliases (M->MT).

Returns (bare_chrom, error). If error is not None the value could not be normalised and bare_chrom is None (the variant should be skipped).

Source code in src/vflank/core/chrom.py

def normalise_chrom(raw_chrom: object) -> tuple[str | None, str | None]:
    """Convert a raw chromosome value to a canonical bare chrom string.

    Handles NaN/None/empty, any case of a ``chr`` prefix, bare integers, numeric
    sex/mito encodings (23->X, 24->Y, 25/26->MT), and mito aliases (M->MT).

    Returns ``(bare_chrom, error)``. If ``error`` is not None the value could not
    be normalised and ``bare_chrom`` is None (the variant should be skipped).
    """
    if raw_chrom is None:
        return None, "chromosome is None"
    if isinstance(raw_chrom, float) and math.isnan(raw_chrom):
        return None, "chromosome is NaN (missing value)"

    # Pandas types any chromosome column containing a NaN as float, turning
    # "17" into 17.0 (and numpy.float64 subclasses float). Recover the integer
    # form so numeric chromosomes normalise instead of being rejected as "17.0".
    if isinstance(raw_chrom, float):
        raw_chrom = int(raw_chrom)

    chrom = str(raw_chrom).strip()
    if not chrom or chrom.lower() in ("nan", "none", ".", ""):
        return None, f"chromosome is empty or missing (got {raw_chrom!r})"

    # Same artifact surviving as a string, e.g. "17.0" read from a text cell.
    if chrom.endswith(".0") and chrom[:-2].isdigit():
        chrom = chrom[:-2]

    upper = chrom.upper()

    # Mitochondrial aliases first (before chr-stripping, to catch 'chrM').
    if upper in _MITO_ALIASES:
        return _MITO_ALIASES[upper], None

    # Case-insensitive chr-prefix stripping; preserve case of the remainder.
    bare = chrom[3:] if upper.startswith("CHR") else chrom

    # Numeric alternative encodings (23->X, 24->Y, 25/26->MT).
    bare = _NUMERIC_CHROM_MAP.get(bare, bare)

    # Uppercase for X/Y/MT consistency.
    if bare.upper() in ("X", "Y", "MT"):
        bare = bare.upper()

    if bare not in VALID_CHROMS:
        return None, (
            f"unrecognised chromosome value {raw_chrom!r} "
            f"(normalised to {bare!r}, not in {sorted(VALID_CHROMS)})"
        )
    return bare, None

contigs_have_chr ¶

contigs_have_chr(contigs)

Return True if any contig in an iterable is chr-prefixed.

Probes chr1/1 first, then falls back through chr2-chr5. Defaults to True (assume prefixed) when undeterminable.

Source code in src/vflank/core/chrom.py

def contigs_have_chr(contigs) -> bool:
    """Return True if any contig in an iterable is ``chr``-prefixed.

    Probes ``chr1``/``1`` first, then falls back through ``chr2``-``chr5``.
    Defaults to True (assume prefixed) when undeterminable.
    """
    refs = set(contigs)
    if "chr1" in refs:
        return True
    if "1" in refs:
        return False
    for i in range(2, 6):
        if f"chr{i}" in refs:
            return True
        if str(i) in refs:
            return False
    return True

chrom_for_contigs ¶

chrom_for_contigs(bare, has_chr)

Convert a bare chromosome to the notation used by a FASTA/VCF.

Source code in src/vflank/core/chrom.py

def chrom_for_contigs(bare: str, has_chr: bool) -> str:
    """Convert a bare chromosome to the notation used by a FASTA/VCF."""
    return f"chr{bare}" if has_chr else bare

detect_series_chr_style ¶

detect_series_chr_style(values)

Inspect an iterable of raw chromosome values for reporting.

Returns True (chr-prefixed), False (bare), or None (unknown/mixed/empty).

Source code in src/vflank/core/chrom.py

def detect_series_chr_style(values) -> bool | None:
    """Inspect an iterable of raw chromosome values for reporting.

    Returns True (chr-prefixed), False (bare), or None (unknown/mixed/empty).
    """
    for val in values:
        if val is None:
            continue
        if isinstance(val, float) and math.isnan(val):
            continue
        v = str(val).strip()
        if not v:
            continue
        if v.lower().startswith("chr"):
            return True
        if v in VALID_CHROMS or v in _NUMERIC_CHROM_MAP:
            return False
    return None

`vflank.core.flanks`¶

vflank.core.flanks ¶

Flank extraction and masking.

A :class:FlankSource is the strategy seam for where each flank base comes from. The reference-backed source implemented here covers modes A (reference) and B (reference + population mask). Mode C/D (BAM consensus) will add a ConsensusFlankSource implementing the same protocol.

FlankResult `dataclass` ¶

Left/right flanks, raw and masked, for one variant.

Source code in src/vflank/core/flanks.py

@dataclass(slots=True)
class FlankResult:
    """Left/right flanks, raw and masked, for one variant."""

    left: str
    right: str
    masked_left: str
    masked_right: str
    covered: int | None = None   # flank positions at/above min_depth (BAM consensus)
    total: int | None = None     # total flank positions (BAM consensus)
    inserted: int | None = None  # flank positions masked for a patient insertion

    @property
    def n_masked(self) -> int:
        return self.masked_left.count("N") + self.masked_right.count("N")

    @property
    def n_corrected(self) -> int:
        """Flank positions where the masked seq is a real base differing from raw."""
        return sum(
            m != r and m != "N"
            for raw, msk in ((self.left, self.masked_left), (self.right, self.masked_right))
            for r, m in zip(raw, msk, strict=False)
        )

    def upper(self) -> FlankResult:
        """Return an uppercased copy (presentation convenience for callers)."""
        return FlankResult(
            self.left.upper(),
            self.right.upper(),
            self.masked_left.upper(),
            self.masked_right.upper(),
            self.covered,
            self.total,
            self.inserted,
        )

n_corrected `property` ¶

n_corrected

Flank positions where the masked seq is a real base differing from raw.

upper ¶

upper()

Return an uppercased copy (presentation convenience for callers).

Source code in src/vflank/core/flanks.py

def upper(self) -> FlankResult:
    """Return an uppercased copy (presentation convenience for callers)."""
    return FlankResult(
        self.left.upper(),
        self.right.upper(),
        self.masked_left.upper(),
        self.masked_right.upper(),
        self.covered,
        self.total,
        self.inserted,
    )

FlankSource ¶

Bases: Protocol

Strategy for producing flanks around a variant.

Source code in src/vflank/core/flanks.py

class FlankSource(Protocol):
    """Strategy for producing flanks around a variant."""

    def fetch(self, variant: Variant) -> FlankResult: ...

ReferenceFlankSource ¶

Flanks pulled from the reference FASTA, optionally masking common SNPs.

MAF coordinates are 1-based fully-closed [start, end]; pysam uses 0-based half-open [start, end). The left flank is the flank bases ending just before the variant; the right flank is the flank bases starting just after it. The variant interval itself is excluded from both flanks.

Source code in src/vflank/core/flanks.py

class ReferenceFlankSource:
    """Flanks pulled from the reference FASTA, optionally masking common SNPs.

    MAF coordinates are 1-based fully-closed ``[start, end]``; pysam uses 0-based
    half-open ``[start, end)``. The left flank is the ``flank`` bases ending just
    before the variant; the right flank is the ``flank`` bases starting just
    after it. The variant interval itself is excluded from both flanks.
    """

    def __init__(self, reference, gnomad=None, *, flank: int = 200, af_threshold: float = 0.001):
        self.reference = reference
        self.gnomad = gnomad
        self.flank = flank
        self.af_threshold = af_threshold

    def fetch(self, variant: Variant) -> FlankResult:
        left_start_0 = max(0, variant.start - self.flank - 1)
        left_end_0 = variant.start - 1
        right_start_0 = variant.end
        right_end_0 = variant.end + self.flank

        left = self.reference.fetch(variant.chrom, left_start_0, left_end_0)
        right = self.reference.fetch(variant.chrom, right_start_0, right_end_0)

        if self.gnomad is None:
            return FlankResult(left, right, left, right)

        left_snps = self.gnomad.get_positions(
            variant.chrom, left_start_0, left_end_0, self.af_threshold
        )
        right_snps = self.gnomad.get_positions(
            variant.chrom, right_start_0, right_end_0, self.af_threshold
        )
        return FlankResult(
            left,
            right,
            mask_sequence(left, left_start_0, left_snps),
            mask_sequence(right, right_start_0, right_snps),
        )

mask_sequence ¶

mask_sequence(seq, region_start_0based, positions_1based)

Replace bases at the given 1-based genomic positions with 'N'.

seq[0] corresponds to genomic position region_start_0based + 1.

Source code in src/vflank/core/flanks.py

def mask_sequence(seq: str, region_start_0based: int, positions_1based: list[int]) -> str:
    """Replace bases at the given 1-based genomic positions with 'N'.

    ``seq[0]`` corresponds to genomic position ``region_start_0based + 1``.
    """
    if not positions_1based:
        return seq
    chars = list(seq)
    for pos in positions_1based:
        idx = pos - region_start_0based - 1
        if 0 <= idx < len(chars):
            chars[idx] = "N"
    return "".join(chars)

`vflank.core.popfreq`¶

vflank.core.popfreq ¶

Population allele-frequency masking source (gnomAD), local-VCF backend.

Masking can draw on gnomAD genome and/or exome data (--pop-data): flanks often fall in non-coding regions where only genomes have data, while exomes add power in coding regions. both masks the union (a position is masked if it is a common SNP in either cohort).

The hot line-parsing kernel (:func:parse_common_snp_positions) is a pure function over an iterable of raw VCF lines, so it is unit-testable without pysam and is the natural seam to later swap for a Rust/noodles implementation.

GnomadStore ¶

Lazy, per-(kind, chromosome) tabix cache over a directory of gnomAD VCFs.

Honours --pop-data (genome / exome / both). For both, queried positions are the union across the two cohorts. Tabix handles are opened on first use; each file's chr-notation is detected on open so queries use the file's own contig names.