Filters Module¶
The filters module provides multiple variant filtering approaches:
bcftools pre-filtering - Fast filtering during variant extraction
SnpSift filtering - Feature-rich filtering on VCF data
TSV filtering - Late-stage filtering on extracted data
DataFrame filtering - Final filtering using pandas query syntax
Filtering module.
This module defines functions to extract variants from a VCF file using bcftools and apply filters via SnpSift. Each function returns a filename containing the output.
Added: - A function to filter final TSV rows by genotype (het, hom, comp_het), optionally
applying different rules per gene based on a separate mapping file.
Enhanced to append a reason label (e.g., “(comphet)”) to each sample genotype if it passed the filter because of ‘comp_het’ or ‘het’ or ‘hom’.
New changes for preserving extra fields (e.g., DP, AD) in parentheses: - If a sample’s entry is “325879(0/1:53,55:108)”, we parse “0/1” as the “main” genotype
but leave the rest of the substring (“:53,55:108”) intact. If this sample passes ‘het’, we produce “325879(0/1:53,55:108)(het)” in the final output.
TSV-based filtering: - Added filter_tsv_with_expression() to apply filters on TSV data after scoring
and annotation steps are complete, allowing filtering on computed columns.
- variantcentrifuge.filters.apply_bcftools_prefilter(input_vcf, output_vcf, filter_expression, cfg)[source]¶
Apply a bcftools filter expression to a VCF file.
This function is used for pre-filtering VCF files before more resource-intensive operations like SnpSift filtering. It applies the filter expression using bcftools view -i option.
- Parameters:
input_vcf (str) – Path to the input VCF file
output_vcf (str) – Path to the output filtered VCF file
filter_expression (str) – bcftools filter expression. Note: bcftools uses different syntax than SnpSift. Examples: - ‘FILTER=”PASS”’ - only PASS variants - ‘INFO/AC<10’ - allele count less than 10 - ‘FILTER=”PASS” && INFO/AC<10’ - combined filters - ‘INFO/AF<0.01 || INFO/AC<5’ - AF less than 1% OR AC less than 5
cfg (dict) – Configuration dictionary that may include: - “threads”: Number of threads to use with bcftools (default = 1)
- Returns:
Path to the output VCF file
- Return type:
- Raises:
RuntimeError – If the bcftools command fails
- variantcentrifuge.filters.extract_variants(vcf_file, bed_file, cfg, output_file)[source]¶
Extract variants from a VCF using bcftools and a BED file.
Write output to the specified compressed VCF (‘.vcf.gz’). bcftools is invoked with the ‘-W’ option, which writes the index file automatically.
- Parameters:
vcf_file (str) – Path to the input VCF file.
bed_file (str) – Path to the BED file containing genomic regions of interest.
cfg (dict) –
Configuration dictionary that may include paths and parameters for tools. Expected keys include:
”threads”: Number of threads to use with bcftools (default = 1).
”bcftools_prefilter”: Optional bcftools filter expression to apply during extraction.
output_file (str) – Path to the final compressed output VCF file (‘.vcf.gz’).
- Returns:
Path to the compressed VCF file (.vcf.gz) containing extracted variants.
- Return type:
- Raises:
RuntimeError – If the extraction command fails.
- variantcentrifuge.filters.apply_snpsift_filter(variant_file, filter_string, cfg, output_file)[source]¶
Apply a SnpSift filter to a variant file, then compress and index the output.
Because our run_command function does not support shell pipelines, we split it into two steps:
Write SnpSift filter output to a temporary uncompressed file (.vcf).
Compress it with bgzip -@ <threads> to produce the final .vcf.gz.
Index the resulting .vcf.gz with bcftools index.
- Parameters:
variant_file (str) – Path to the compressed VCF file with extracted variants.
filter_string (str) – SnpSift filter expression to apply.
cfg (dict) –
Configuration dictionary that may include paths and parameters for tools. Expected keys include:
”threads”: Number of threads to use with bgzip and bcftools index (default = 1).
output_file (str) – Path to the compressed VCF file (.vcf.gz) containing filtered variants.
- Returns:
Path to the compressed VCF file (.vcf.gz) containing filtered variants.
- Return type:
- Raises:
RuntimeError – If the filter command fails.
- variantcentrifuge.filters.filter_final_tsv_by_genotype(input_tsv, output_tsv, global_genotypes=None, gene_genotype_file=None, gene_column_name='GENE', gt_column_name='GT')[source]¶
Filter the final TSV rows by genotype.
This can be done globally (using a single set of requested genotypes like {“het”}, {“hom”}, or {“comp_het”}) or on a per-gene basis if gene_genotype_file is provided. The gene_genotype_file must contain at least two columns:
GENE
GENOTYPES (one or more of het, hom, comp_het, comma-separated)
- The logic is:
‘het’ => keep samples with genotype 0/1 or 1/0
‘hom’ => keep samples with genotype 1/1
- ‘comp_het’ => keep samples that have at least two distinct variants in the same gene
with genotype 0/1 or 1/0
If a gene is defined in the gene_genotype_file, then the union of those genotype rules is applied. If a gene is not defined in that file (or if none is provided), global_genotypes is used.
The resulting TSV keeps only lines that have at least one sample fulfilling the chosen genotype filters. If no samples remain on a line, that line is discarded.
Additionally, if a sample passes because of ‘het’ or ‘hom’ or ‘comp_het’, we append a reason marker. For example:
325879(0/1:53,55:108) => 325879(0/1:53,55:108)(het) or 325879(0/1:53,55:108)(het,comphet)
- Parameters:
input_tsv (str) – Path to the input TSV, e.g. the final genotype_replaced TSV.
output_tsv (str) – Path to the filtered output TSV.
global_genotypes (set of str, optional) – A set of genotype filters to apply globally if no gene-specific rule is found. E.g. {“het”}, {“hom”}, or {“comp_het”}, or any combination thereof.
gene_genotype_file (str, optional) – Path to a file with columns ‘GENE’ and ‘GENOTYPES’. Each row can specify one or more genotype filters for a given gene, e.g. “BRCA2 het,comp_het”.
gene_column_name (str) – The column name in the TSV that specifies the gene.
gt_column_name (str) – The column name in the TSV that specifies the genotype(s) for each sample.
- Returns:
A new TSV is written to output_tsv.
- Return type:
None
Notes
For ‘comp_het’, we group rows by gene, identify samples that appear at least twice (with 0/1 or 1/0), then keep those rows for those samples only. We also annotate each genotype with the reason(s) it passed (het, hom, comphet).
Examples
>>> filter_final_tsv_by_genotype( ... "input.genotype_replaced.tsv", ... "output.genotype_filtered.tsv", ... global_genotypes={"comp_het"} ... )
- variantcentrifuge.filters.filter_tsv_with_expression(input_tsv, output_tsv, filter_expression, pandas_query=True)[source]¶
Filter a TSV file using a filter expression.
This function enables filtering on any column in the TSV, including computed columns like scores, inheritance patterns, and custom annotations that are added during the analysis pipeline.
- Parameters:
input_tsv (str) – Path to the input TSV file
output_tsv (str) – Path to the output filtered TSV file
filter_expression (str) – Filter expression. If pandas_query is True, this should be a pandas query string (e.g., “Score > 0.5 & Impact == ‘HIGH’”). If False, it should be a SnpSift-style expression that will be translated.
pandas_query (bool) – If True, use pandas query syntax. If False, translate from SnpSift syntax.
- Return type: