Filters Module¶

The filters module provides multiple variant filtering approaches:

bcftools pre-filtering - Fast filtering during variant extraction
SnpSift filtering - Feature-rich filtering on VCF data
TSV filtering - Late-stage filtering on extracted data
DataFrame filtering - Final filtering using pandas query syntax

Filtering module.

This module defines functions to extract variants from a VCF file using bcftools and apply filters via SnpSift. Each function returns a filename containing the output.

Added: - A function to filter final TSV rows by genotype (het, hom, comp_het), optionally

applying different rules per gene based on a separate mapping file.

Enhanced to append a reason label (e.g., “(comphet)”) to each sample genotype if it passed the filter because of ‘comp_het’ or ‘het’ or ‘hom’.

New changes for preserving extra fields (e.g., DP, AD) in parentheses: - If a sample’s entry is “325879(0/1:53,55:108)”, we parse “0/1” as the “main” genotype

but leave the rest of the substring (“:53,55:108”) intact. If this sample passes ‘het’, we produce “325879(0/1:53,55:108)(het)” in the final output.

TSV-based filtering: - Added filter_tsv_with_expression() to apply filters on TSV data after scoring

and annotation steps are complete, allowing filtering on computed columns.

variantcentrifuge.filters.apply_bcftools_prefilter(input_vcf, output_vcf, filter_expression, cfg)[source]¶

Apply a bcftools filter expression to a VCF file.

This function is used for pre-filtering VCF files before more resource-intensive operations like SnpSift filtering. It applies the filter expression using bcftools view -i option.

Parameters:

input_vcf (str) – Path to the input VCF file
output_vcf (str) – Path to the output filtered VCF file
filter_expression (str) – bcftools filter expression. Note: bcftools uses different syntax than SnpSift. Examples: - ‘FILTER=”PASS”’ - only PASS variants - ‘INFO/AC<10’ - allele count less than 10 - ‘FILTER=”PASS” && INFO/AC<10’ - combined filters - ‘INFO/AF<0.01 || INFO/AC<5’ - AF less than 1% OR AC less than 5
cfg (dict) – Configuration dictionary that may include: - “threads”: Number of threads to use with bcftools (default = 1)

Returns:

Path to the output VCF file

Return type:

str

Raises:

RuntimeError – If the bcftools command fails

variantcentrifuge.filters.extract_variants(vcf_file, bed_file, cfg, output_file)[source]¶

Extract variants from a VCF using bcftools and a BED file.

Write output to the specified compressed VCF (‘.vcf.gz’). bcftools is invoked with the ‘-W’ option, which writes the index file automatically.

Parameters:

vcf_file (str) – Path to the input VCF file.
bed_file (str) – Path to the BED file containing genomic regions of interest.
cfg (dict) –
Configuration dictionary that may include paths and parameters for tools. Expected keys include:
- ”threads”: Number of threads to use with bcftools (default = 1).
- ”bcftools_prefilter”: Optional bcftools filter expression to apply during extraction.
output_file (str) – Path to the final compressed output VCF file (‘.vcf.gz’).

Returns:

Path to the compressed VCF file (.vcf.gz) containing extracted variants.

Return type:

str

Raises:

RuntimeError – If the extraction command fails.

variantcentrifuge.filters.apply_snpsift_filter(variant_file, filter_string, cfg, output_file)[source]¶

Apply a SnpSift filter to a variant file, then compress and index the output.

Because our run_command function does not support shell pipelines, we split it into two steps:

Write SnpSift filter output to a temporary uncompressed file (.vcf).

Compress it with bgzip -@ <threads> to produce the final .vcf.gz.

Index the resulting .vcf.gz with bcftools index.

Parameters:

variant_file (str) – Path to the compressed VCF file with extracted variants.
filter_string (str) – SnpSift filter expression to apply.
cfg (dict) –
Configuration dictionary that may include paths and parameters for tools. Expected keys include:
- ”threads”: Number of threads to use with bgzip and bcftools index (default = 1).
output_file (str) – Path to the compressed VCF file (.vcf.gz) containing filtered variants.

Returns:

Path to the compressed VCF file (.vcf.gz) containing filtered variants.

Return type:

str

Raises:

RuntimeError – If the filter command fails.

variantcentrifuge.filters.filter_final_tsv_by_genotype(input_tsv, output_tsv, global_genotypes=None, gene_genotype_file=None, gene_column_name='GENE', gt_column_name='GT')[source]¶

Filter the final TSV rows by genotype.

This can be done globally (using a single set of requested genotypes like {“het”}, {“hom”}, or {“comp_het”}) or on a per-gene basis if gene_genotype_file is provided. The gene_genotype_file must contain at least two columns:

GENE

GENOTYPES (one or more of het, hom, comp_het, comma-separated)

The logic is:

‘het’ => keep samples with genotype 0/1 or 1/0
‘hom’ => keep samples with genotype 1/1
‘comp_het’ => keep samples that have at least two distinct variants in the same gene
with genotype 0/1 or 1/0

If a gene is defined in the gene_genotype_file, then the union of those genotype rules is applied. If a gene is not defined in that file (or if none is provided), global_genotypes is used.

The resulting TSV keeps only lines that have at least one sample fulfilling the chosen genotype filters. If no samples remain on a line, that line is discarded.

Additionally, if a sample passes because of ‘het’ or ‘hom’ or ‘comp_het’, we append a reason marker. For example:

325879(0/1:53,55:108) => 325879(0/1:53,55:108)(het) or 325879(0/1:53,55:108)(het,comphet)

Parameters:

input_tsv (str) – Path to the input TSV, e.g. the final genotype_replaced TSV.
output_tsv (str) – Path to the filtered output TSV.
global_genotypes (set of str, optional) – A set of genotype filters to apply globally if no gene-specific rule is found. E.g. {“het”}, {“hom”}, or {“comp_het”}, or any combination thereof.
gene_genotype_file (str, optional) – Path to a file with columns ‘GENE’ and ‘GENOTYPES’. Each row can specify one or more genotype filters for a given gene, e.g. “BRCA2 het,comp_het”.
gene_column_name (str) – The column name in the TSV that specifies the gene.
gt_column_name (str) – The column name in the TSV that specifies the genotype(s) for each sample.

Returns:

A new TSV is written to output_tsv.

Return type:

None

Notes

For ‘comp_het’, we group rows by gene, identify samples that appear at least twice (with 0/1 or 1/0), then keep those rows for those samples only. We also annotate each genotype with the reason(s) it passed (het, hom, comphet).

Examples

>>> filter_final_tsv_by_genotype(
...     "input.genotype_replaced.tsv",
...     "output.genotype_filtered.tsv",
...     global_genotypes={"comp_het"}
... )

variantcentrifuge.filters.filter_tsv_with_expression(input_tsv, output_tsv, filter_expression, pandas_query=True)[source]¶

Filter a TSV file using a filter expression.

This function enables filtering on any column in the TSV, including computed columns like scores, inheritance patterns, and custom annotations that are added during the analysis pipeline.

Parameters:

input_tsv (str) – Path to the input TSV file
output_tsv (str) – Path to the output filtered TSV file
filter_expression (str) – Filter expression. If pandas_query is True, this should be a pandas query string (e.g., “Score > 0.5 & Impact == ‘HIGH’”). If False, it should be a SnpSift-style expression that will be translated.
pandas_query (bool) – If True, use pandas query syntax. If False, translate from SnpSift syntax.

Return type:

None

variantcentrifuge.filters.filter_dataframe_with_query(df, filter_expression)[source]¶

Filters a pandas DataFrame using a query expression.

Parameters:

df (DataFrame) – The input DataFrame.
filter_expression (str) – The query string to apply.

Return type:

DataFrame

Returns:

The filtered DataFrame.