Gene Burden Module¶
Statistical methods for gene burden analysis
Gene burden analysis module.
Implements a collapsing burden test (CMC/CAST) for rare variant association.
Two modes are supported: - “samples” (default): Counts unique carrier samples per gene (binary collapse).
Each sample is counted once regardless of how many qualifying variants it carries. This is the CMC/CAST collapsing test (Li & Leal 2008, Morgenthaler & Thilly 2007).
“alleles”: For each sample, takes the maximum allele dosage (0/1/2) across all qualifying variant sites in the gene, then sums across samples. This preserves the diploid constraint (total <= 2*N) required for Fisher’s exact test.
Statistical testing uses Fisher’s exact test on a 2x2 contingency table with Benjamini-Hochberg FDR or Bonferroni correction for multiple testing.
References
Li B, Leal SM. Am J Hum Genet. 2008;83(3):311-321 (CMC method)
Morgenthaler S, Thilly WG. Mutat Res. 2007;615(1-2):28-56 (CAST)
- variantcentrifuge.gene_burden.perform_gene_burden_analysis(df, cfg, case_samples=None, control_samples=None, vcf_samples=None)[source]¶
Perform gene burden analysis using a collapsing test with Fisher’s exact test.
When case_samples and control_samples are provided, uses proper per-sample collapsing (CMC/CAST method) to avoid double-counting samples with variants at multiple sites in the same gene.
Three aggregation strategies (selected automatically by priority): 1. Column-based (fastest): Uses per-sample GT columns (GEN_0__GT, etc.)
when vcf_samples is provided and columns exist in the DataFrame.
Packed GT string: Parses “Sample1(0/1);Sample2(1/1)” format.
Legacy: Sums pre-computed per-variant counts (backward compatibility).
- Parameters:
df (pd.DataFrame) – DataFrame with variant data. Must include “GENE” column.
cfg (dict) – Configuration dictionary with keys: - “gene_burden_mode”: “samples” (carrier collapse) or “alleles” (max dosage) - “correction_method”: “fdr” or “bonferroni” - “confidence_interval_method”: str (optional, default “normal_approx”) - “confidence_interval_alpha”: float (optional, default 0.05)
case_samples (set of str, optional) – Case sample IDs. When provided with control_samples, enables proper per-sample collapsing.
control_samples (set of str, optional) – Control sample IDs.
vcf_samples (list of str, optional) – Ordered VCF sample names. When provided with per-sample GT columns, enables fast column-based aggregation.
- Returns:
Gene-level burden results with p-values, odds ratios, and CIs.
- Return type:
pd.DataFrame