Replacer Module

Genotype replacement with sample IDs

Genotype replacement module.

This module dynamically searches for a “GT” column in the header line of the input TSV and applies configurable genotype replacement logic.

  1. Find the GT column in the TSV header.

  2. For each variant row: - Parse the genotype subfields (split by cfg[“extract_fields_separator”], usually “:”) - Skip non-variant genotypes (“0/0” or “./.”). - If a genotype_replacement_map is provided (e.g. {r”[2-9]”: “1”}), apply regex-based replacements

    to transform the allele strings.

  3. If cfg[“append_extra_sample_fields”] is True, the user-supplied fields in cfg[“extra_sample_fields”] are appended to each genotype. For instance, if extra fields are “GEN[*].DP” and “GEN[*].AD”, and the delimiter is “:”, then an output entry might look like:

    sampleName(0/1:100:52,48)

  4. The sample names come from cfg[“sample_list”] (comma-separated). We pair each subfield in GT with the corresponding sample. We do not list or output samples if the genotype is “0/0” or “./.”.

  5. The final joined string for each variant is placed back into the “GT” column, separated from other variants by cfg[“separator”] (commonly “;”).

Header Normalization

SnpSift or other tools may remove “GEN[*].” or “ANN[*].” prefixes in headers. To handle this, we normalize both the final TSV header columns and the user-supplied extra fields so they match.

Example

If the user passed –append-extra-sample-fields GEN[*].DP GEN[*].AD and the TSV header has columns “DP”, “AD” (prefix removed by SnpSift), the code will still find them by normalizing them to “DP”, “AD” internally.

variantcentrifuge.replacer.replace_genotypes(lines, cfg)[source]

Replace and optionally append extra sample fields to genotypes in a TSV file.

Requirements:

  • A “GT” column must exist in the header.

  • cfg[“sample_list”] is a comma-separated list of sample names matching the subfields in “GT”.

  • If cfg[“append_extra_sample_fields”] is True and cfg[“extra_sample_fields”] is a list of columns, we will look them up by matching normalized column names in the header.

Steps:

  1. Identify the index of the “GT” column in the TSV header.

  2. Build a map from normalized column names to the actual index in the header.

  3. For each variant line: - Split the “GT” column subfields (one subfield per sample). - Optionally apply genotype replacements from cfg[“genotype_replacement_map”]. - Filter out genotypes that are “0/0” or “./.”. - If extra fields exist, append them in parentheses next to each genotype, e.g. “DP:100:52,48”. - Rejoin them with cfg[“separator”] (often “;”) for final placement in the “GT” column.

returns:

Iterator of updated lines (strings).

rtype:

Iterator[str]

param lines:

type lines:

Iterator[str]

param cfg:

type cfg:

Dict[str, Any]