Replacer Module¶
Genotype replacement with sample IDs
Genotype replacement module.
This module dynamically searches for a “GT” column in the header line of the input TSV and applies configurable genotype replacement logic.
Find the GT column in the TSV header.
For each variant row: - Parse the genotype subfields (split by cfg[“extract_fields_separator”], usually “:”) - Skip non-variant genotypes (“0/0” or “./.”). - If a genotype_replacement_map is provided (e.g. {r”[2-9]”: “1”}), apply regex-based replacements
to transform the allele strings.
If cfg[“append_extra_sample_fields”] is True, the user-supplied fields in cfg[“extra_sample_fields”] are appended to each genotype. For instance, if extra fields are “GEN[*].DP” and “GEN[*].AD”, and the delimiter is “:”, then an output entry might look like:
sampleName(0/1:100:52,48)
The sample names come from cfg[“sample_list”] (comma-separated). We pair each subfield in GT with the corresponding sample. We do not list or output samples if the genotype is “0/0” or “./.”.
The final joined string for each variant is placed back into the “GT” column, separated from other variants by cfg[“separator”] (commonly “;”).
Header Normalization¶
SnpSift or other tools may remove “GEN[*].” or “ANN[*].” prefixes in headers. To handle this, we normalize both the final TSV header columns and the user-supplied extra fields so they match.
Example
If the user passed –append-extra-sample-fields GEN[*].DP GEN[*].AD and the TSV header has columns “DP”, “AD” (prefix removed by SnpSift), the code will still find them by normalizing them to “DP”, “AD” internally.
- variantcentrifuge.replacer.replace_genotypes(lines, cfg)[source]¶
Replace and optionally append extra sample fields to genotypes in a TSV file.
Requirements:¶
A “GT” column must exist in the header.
cfg[“sample_list”] is a comma-separated list of sample names matching the subfields in “GT”.
If cfg[“append_extra_sample_fields”] is True and cfg[“extra_sample_fields”] is a list of columns, we will look them up by matching normalized column names in the header.
Steps:¶
Identify the index of the “GT” column in the TSV header.
Build a map from normalized column names to the actual index in the header.
For each variant line: - Split the “GT” column subfields (one subfield per sample). - Optionally apply genotype replacements from cfg[“genotype_replacement_map”]. - Filter out genotypes that are “0/0” or “./.”. - If extra fields exist, append them in parentheses next to each genotype, e.g. “DP:100:52,48”. - Rejoin them with cfg[“separator”] (often “;”) for final placement in the “GT” column.