About

wisp creates population count masks for population genomic workflows. Where a conventional depth mask gives a single cohort-wide pass/fail per site, wisp reports how many samples in each population clear the depth threshold.

Why population count masks?

Downstream analyses often need to know whether enough individuals in each population have usable depth at a site or region. A population count mask makes that explicit:

#chrom  start  end  GBR  YRI
chr20   9999999  10001042  7  4
chr20   10001042 10001881 10  6

Each population column is a count of samples passing the configured depth threshold over that interval. Adjacent bases with the same counts are collapsed into a single BED interval.

Input modes

BAM/CRAM mode

In alignment mode, wisp runs mosdepth once per sample using a two-bin quantization: below threshold and at-or-above threshold. It extracts the passing intervals, optionally clips them to a mask BED, intersects all sample pass BEDs with bedtools multiinter, and assembles them into a population count mask.

An optional variants-only VCF can modify this alignment workflow. wisp can estimate omitted depth and mapping-quality thresholds from the VCF, and it subtracts indel, structural-variant, breakend, and multi-nucleotide polymorphism spans from every sample pass BED. Because the final BED is sparse, these excluded spans are represented as absent intervals, meaning zero passing samples in every population.

All-sites VCF mode

In VCF mode, wisp reads FORMAT/DP values directly from an all-sites VCF. A sample passes a base when its DP value is greater than or equal to --min-dp and, if --max-dp is supplied, less than or equal to --max-dp. When FORMAT/GT is present, the genotype must also be non-missing. Duplicate records at the same CHROM:POS are merged with OR semantics per sample: if any duplicate passes the depth and genotype checks, the sample passes that base. Duplicate records must be contiguous, as in a coordinate-sorted VCF.

Sparse output

wisp omits intervals where all population counts are zero. Consumers should treat missing intervals as zero passing samples per population, not as unknown or skipped.