Gene Burden Module

Statistical methods for gene burden analysis

Gene burden analysis module.

Implements a collapsing burden test (CMC/CAST) for rare variant association.

Two modes are supported: - “samples” (default): Counts unique carrier samples per gene (binary collapse).

Each sample is counted once regardless of how many qualifying variants it carries. This is the CMC/CAST collapsing test (Li & Leal 2008, Morgenthaler & Thilly 2007).

  • “alleles”: For each sample, takes the maximum allele dosage (0/1/2) across all qualifying variant sites in the gene, then sums across samples. This preserves the diploid constraint (total <= 2*N) required for Fisher’s exact test.

Statistical testing uses Fisher’s exact test on a 2x2 contingency table with Benjamini-Hochberg FDR or Bonferroni correction for multiple testing.

References

  • Li B, Leal SM. Am J Hum Genet. 2008;83(3):311-321 (CMC method)

  • Morgenthaler S, Thilly WG. Mutat Res. 2007;615(1-2):28-56 (CAST)

variantcentrifuge.gene_burden.perform_gene_burden_analysis(df, cfg, case_samples=None, control_samples=None, vcf_samples=None)[source]

Perform gene burden analysis using a collapsing test with Fisher’s exact test.

When case_samples and control_samples are provided, uses proper per-sample collapsing (CMC/CAST method) to avoid double-counting samples with variants at multiple sites in the same gene.

Three aggregation strategies (selected automatically by priority): 1. Column-based (fastest): Uses per-sample GT columns (GEN_0__GT, etc.)

when vcf_samples is provided and columns exist in the DataFrame.

  1. Packed GT string: Parses “Sample1(0/1);Sample2(1/1)” format.

  2. Legacy: Sums pre-computed per-variant counts (backward compatibility).

Parameters:
  • df (pd.DataFrame) – DataFrame with variant data. Must include “GENE” column.

  • cfg (dict) – Configuration dictionary with keys: - “gene_burden_mode”: “samples” (carrier collapse) or “alleles” (max dosage) - “correction_method”: “fdr” or “bonferroni” - “confidence_interval_method”: str (optional, default “normal_approx”) - “confidence_interval_alpha”: float (optional, default 0.05)

  • case_samples (set of str, optional) – Case sample IDs. When provided with control_samples, enables proper per-sample collapsing.

  • control_samples (set of str, optional) – Control sample IDs.

  • vcf_samples (list of str, optional) – Ordered VCF sample names. When provided with per-sample GT columns, enables fast column-based aggregation.

Returns:

Gene-level burden results with p-values, odds ratios, and CIs.

Return type:

pd.DataFrame