Input files

Sample table for BAM/CRAM mode

--samples points to a tab-delimited table with three required fields: sample ID, population, and alignment path.

The table accepts an optional header:

sample_id  population  alignment
HG00096    GBR         bams/HG00096.bam
NA18486    YRI         bams/NA18486.bam

Accepted header aliases:

  • Sample ID: sample_id, sample, or id

  • Population: population or pop

  • Alignment: alignment, bam_or_cram, bam, or cram

Without a recognized header, columns are read in order: sample_id, population, alignment.

Alignment paths may be absolute or relative. For relative paths, wisp first checks the current value, then the path relative to the sample table's parent directory.

When read group sample names are present in a BAM/CRAM header, each @RG SM value must match the sample_id for that row.

Popfile for all-sites VCF mode

--popfile points to a tab-delimited table with sample ID and population columns:

sample_id  population
HG00096    GBR
NA18486    YRI

The same header aliases apply. Without a recognized header, columns are read as sample_id then population.

Sample and population labels

Sample IDs and population names must be non-empty and must not contain whitespace or path separators. Sample IDs must be unique.

All-sites VCF

--all-sites-vcf must point to a VCF with a #CHROM header and sample columns. Every sample ID in --popfile must appear in the VCF. VCF samples absent from --popfile produce a warning and are ignored.

For each record, wisp reads the DP field from the sample's FORMAT value. Missing DP values do not pass. Non-integer DP values are rejected. A sample passes when DP is at least --min-dp, if --max-dp is supplied no greater than --max-dp, and, when GT is present in FORMAT, the genotype is not missing. Records may carry any FILTER value; the file is assumed to have been filtered as desired. Duplicate CHROM:POS records are merged with OR semantics per sample, but duplicates must be contiguous, as in a coordinate-sorted VCF.

Variants-only VCF for BAM/CRAM mode

--variants-vcf is optional in BAM/CRAM mode. It should point to a coordinate-sorted variants-only VCF for the same samples and reference coordinate system as the alignments. When sample columns are present, every sample ID in --samples must appear in the VCF; extra VCF samples are ignored for threshold estimation.

wisp uses this VCF in two ways.

Threshold estimation

If --min-dp or --max-dp is omitted, wisp estimates it from positive per-sample FORMAT/DP values at variant records. --min-dp uses the smallest observed positive DP and --max-dp uses the largest observed positive DP. If --min-mapq is omitted, wisp estimates it from the smallest INFO/MQ value, rounded down to an integer.

Any threshold supplied manually on the command line takes precedence over the VCF-derived estimate. If --min-dp is omitted and the VCF has no usable FORMAT/DP values, the run is rejected.

Variant exclusions

The same VCF is scanned for variant classes that should not contribute callable single-base denominators: indels, symbolic structural variants, breakends, and multi-nucleotide polymorphisms. SNP-only records are retained.

Exclusion spans are emitted in BED coordinates. For symbolic structural variants, INFO/END is preferred when present; otherwise SVLEN is used when available. For ordinary sequence alleles, the reference allele length is used. The resulting intervals are sorted, merged, and subtracted from every sample pass BED before population counts are built.

Mask BED

--mask accepts a BED file with at least three columns. Blank lines, comments, and a chrom start end header are ignored. Coordinates must be non-negative, 0-based, half-open intervals where start < end.

In BAM/CRAM mode, the mask is sorted and merged with bedtools before use. In VCF mode, overlapping intervals are merged internally before bases are tested for containment.