Analyze Variants Module¶
The analyze_variants module provides variant-level analysis and gene burden testing.
Variant analysis module for gene burden and other statistics.
Refactored to be more modular: - Basic and comprehensive stats computations moved to stats.py. - Gene burden analysis moved to gene_burden.py. - Helper functions moved to helpers.py.
This module now: - Reads a TSV of variants and their annotations (including a GT column listing sample genotypes). - Classifies samples into case/control sets based on user input (sample lists or phenotype terms). - If no case/control criteria are provided, defaults to making all samples controls. - Invokes external modules (stats.py, gene_burden.py) for computations. - Orchestrates flow and writes or yields output lines.
Maintains previous functionality, CLI interface, and output format.
- variantcentrifuge.analyze_variants.analyze_variants(lines, cfg)[source]¶
Analyze variants and optionally perform gene burden analysis.
Steps¶
Parse input TSV into a DataFrame.
Retrieve the full sample list from cfg[“sample_list”].
Determine case/control sets based on cfg (samples or phenotypes).
Compute per-variant case/control allele counts.
Compute basic and optionally comprehensive gene-level stats.
If requested, perform gene burden (Fisher’s exact test + correction).
- type lines:
- param lines:
Input lines representing a TSV with variant data.
- type lines:
Iterator[str]
- type cfg:
- param cfg:
Configuration dictionary. Keys include: - sample_list (str): comma-separated full sample list from VCF - case_samples, control_samples (List[str]): optional lists of samples - case_phenotypes, control_phenotypes (List[str]): optional lists of phenotypes - perform_gene_burden (bool): Whether to perform gene burden analysis - gene_burden_mode (str): “samples” or “alleles” - correction_method (str): “fdr” or “bonferroni” - no_stats (bool): Skip stats if True - stats_output_file (str): Path to stats output - gene_burden_output_file (str, optional): Path to gene burden output - xlsx (bool): If True, might append to Excel after analysis
- type cfg:
Dict[str, Any]
- Yields:
str – Processed lines of output TSV or gene-level burden results.