Replacer Module¶

Genotype replacement with sample IDs

Genotype replacement module.

This module dynamically searches for a “GT” column in the header line of the input TSV and applies configurable genotype replacement logic.

Find the GT column in the TSV header.
For each variant row: - Parse the genotype subfields (split by cfg[“extract_fields_separator”], usually “:”) - Skip non-variant genotypes (“0/0” or “./.”). - If a genotype_replacement_map is provided (e.g. {r”[2-9]”: “1”}), apply regex-based replacements

System Message: ERROR/3 (/home/runner/work/variantcentrifuge/variantcentrifuge/variantcentrifuge/replacer.py:docstring of variantcentrifuge.replacer, line 12)

Unexpected indentation.

to transform the allele strings.
If cfg[“append_extra_sample_fields”] is True, the user-supplied fields in cfg[“extra_sample_fields”] are appended to each genotype. For instance, if extra fields are “GEN[*].DP” and “GEN[*].AD”, and the delimiter is “:”, then an output entry might look like:

System Message: ERROR/3 (/home/runner/work/variantcentrifuge/variantcentrifuge/variantcentrifuge/replacer.py:docstring of variantcentrifuge.replacer, line 17)

Unexpected indentation.

sampleName(0/1:100:52,48)
The sample names come from cfg[“sample_list”] (comma-separated). We pair each subfield in GT with the corresponding sample. We do not list or output samples if the genotype is “0/0” or “./.”.
The final joined string for each variant is placed back into the “GT” column, separated from other variants by cfg[“separator”] (commonly “;”).

Header Normalization¶

SnpSift or other tools may remove “GEN[*].” or “ANN[*].” prefixes in headers. To handle this, we normalize both the final TSV header columns and the user-supplied extra fields so they match.

Example

If the user passed –append-extra-sample-fields GEN[*].DP GEN[*].AD and the TSV header has columns “DP”, “AD” (prefix removed by SnpSift), the code will still find them by normalizing them to “DP”, “AD” internally.

variantcentrifuge.replacer.replace_genotypes(lines, cfg)[source]¶

Replace and optionally append extra sample fields to genotypes in a TSV file.

Requirements:¶

A “GT” column must exist in the header.
cfg[“sample_list”] is a comma-separated list of sample names matching the subfields in “GT”.
If cfg[“append_extra_sample_fields”] is True and cfg[“extra_sample_fields”] is a list of columns, we will look them up by matching normalized column names in the header.

Steps:¶

Identify the index of the “GT” column in the TSV header.
Build a map from normalized column names to the actual index in the header.
For each variant line: - Split the “GT” column subfields (one subfield per sample). - Optionally apply genotype replacements from cfg[“genotype_replacement_map”]. - Filter out genotypes that are “0/0” or “./.”. - If extra fields exist, append them in parentheses next to each genotype, e.g. “DP:100:52,48”. - Rejoin them with cfg[“separator”] (often “;”) for final placement in the “GT” column.

returns:: Iterator of updated lines (strings).
rtype:: Iterator[str]
param lines:
type lines:: Iterator[str]
param cfg:
type cfg:: Dict[str, Any]