Gene Burden Module¶
Statistical methods for gene burden analysis
Gene burden analysis module.
Provides:
perform_gene_burden_analysis: Aggregates per-gene counts (samples or alleles), performs Fisher’s exact test, calculates confidence intervals, and applies multiple testing correction.
New Features (Issue #21):
Adds confidence intervals to the gene burden metrics (e.g., odds ratio).
Confidence interval calculation method and confidence level can be configured.
Updated for Issue #31:
Improved handling of edge cases (e.g., infinite or zero odds_ratio). Now properly detects structural zeros and applies continuity correction for zero cells to calculate meaningful confidence intervals.
Uses score method as primary CI calculation (more robust for sparse data).
Returns NaN for structural zeros where OR cannot be calculated.
Configuration Additions:
- “confidence_interval_method”: str
Method for confidence interval calculation. Defaults to “normal_approx”.
- “confidence_interval_alpha”: float
Significance level for confidence interval calculation. Default: 0.05 for a 95% CI.
- “continuity_correction”: float
Value to add to zero cells for continuity correction. Default: 0.5.
Outputs¶
In addition to existing columns (p-values, odds ratio), the result now includes: - “or_ci_lower”: Lower bound of the odds ratio confidence interval. - “or_ci_upper”: Upper bound of the odds ratio confidence interval.
- variantcentrifuge.gene_burden.perform_gene_burden_analysis(df, cfg)[source]¶
Perform gene burden analysis for each gene.
Steps¶
Aggregate variant counts per gene based on the chosen mode (samples or alleles).
Perform Fisher’s exact test for each gene.
Apply multiple testing correction (FDR or Bonferroni).
Compute and add confidence intervals for the odds ratio.
- type df:
- param df:
DataFrame with per-variant case/control counts. Must include columns: “GENE”, “proband_count”, “control_count”, “proband_variant_count”, “control_variant_count”, “proband_allele_count”, “control_allele_count”.
- type df:
pd.DataFrame
- type cfg:
- param cfg:
Configuration dictionary with keys:
- “gene_burden_mode”: str
“samples” or “alleles” indicating the aggregation mode.
- “correction_method”: str
“fdr” or “bonferroni” for multiple testing correction.
- “confidence_interval_method”: str (optional)
Method for CI calculation (“normal_approx”), defaults to “normal_approx”.
- “confidence_interval_alpha”: float (optional)
Significance level for CI, defaults to 0.05 for a 95% CI.
- type cfg:
dict
- returns:
A DataFrame with gene-level burden results, including p-values, odds ratios, and confidence intervals.
The output includes:
“GENE”
“proband_count”, “control_count”
Either “proband_variant_count”, “control_variant_count” or “proband_allele_count”, “control_allele_count” depending on mode
“raw_p_value”
“corrected_p_value”
“odds_ratio”
“or_ci_lower”
“or_ci_upper”
- rtype:
pd.DataFrame
Notes
If no fisher_exact test is available, p-values default to 1.0. Confidence intervals now attempt multiple methods and fallback intervals in edge cases (Issue #31).