Analyze Variants Module

The analyze_variants module provides variant-level analysis and gene burden testing.

Variant analysis module for gene burden and other statistics.

Refactored to be more modular: - Basic and comprehensive stats computations moved to stats.py. - Gene burden analysis moved to gene_burden.py. - Helper functions moved to helpers.py.

This module now: - Reads a TSV of variants and their annotations (including a GT column listing sample genotypes). - Classifies samples into case/control sets based on user input (sample lists or phenotype terms). - If no case/control criteria are provided, defaults to making all samples controls. - Invokes external modules (stats.py, gene_burden.py) for computations. - Orchestrates flow and writes or yields output lines.

Maintains previous functionality, CLI interface, and output format.

variantcentrifuge.analyze_variants.analyze_variants(lines, cfg)[source]

Analyze variants and optionally perform gene burden analysis.

Return type:

Iterator[str]

Steps

  1. Parse input TSV into a DataFrame.

  2. Retrieve the full sample list from cfg[“sample_list”].

  3. Determine case/control sets based on cfg (samples or phenotypes).

  4. Compute per-variant case/control allele counts.

  5. Compute basic and optionally comprehensive gene-level stats.

  6. If requested, perform gene burden (Fisher’s exact test + correction).

type lines:

Iterator[str]

param lines:

Input lines representing a TSV with variant data.

type lines:

Iterator[str]

type cfg:

Dict[str, Any]

param cfg:

Configuration dictionary. Keys include: - sample_list (str): comma-separated full sample list from VCF - case_samples, control_samples (List[str]): optional lists of samples - case_phenotypes, control_phenotypes (List[str]): optional lists of phenotypes - perform_gene_burden (bool): Whether to perform gene burden analysis - gene_burden_mode (str): “samples” or “alleles” - correction_method (str): “fdr” or “bonferroni” - no_stats (bool): Skip stats if True - stats_output_file (str): Path to stats output - gene_burden_output_file (str, optional): Path to gene burden output - xlsx (bool): If True, might append to Excel after analysis

type cfg:

Dict[str, Any]

Yields:

str – Processed lines of output TSV or gene-level burden results.