Gene Burden Module¶
Statistical methods for gene burden analysis
Gene burden analysis module.
Provides: - perform_gene_burden_analysis: Aggregates per-gene counts (samples or alleles),
performs Fisher’s exact test, calculates confidence intervals, and applies multiple testing correction.
New Features (Issue #21): - Adds confidence intervals to the gene burden metrics (e.g., odds ratio). - Confidence interval calculation method and confidence level can be configured.
Updated for Issue #31: - Improved handling of edge cases (e.g., infinite or zero odds_ratio).
Instead of returning NaN for confidence intervals, attempts a fallback method (“logit”) if the primary method fails. If still invalid, returns bounded fallback intervals to ensure meaningful output.
Configuration Additions¶
- “confidence_interval_method”: str
Method for confidence interval calculation. Supports: - “normal_approx”: Uses statsmodels’ Table2x2 normal approximation for OR CI.
Will fallback to “logit” if normal fails, and then to bounded fallback.
- “confidence_interval_alpha”: float
Significance level for confidence interval calculation. Default: 0.05 for a 95% CI.
Outputs¶
In addition to existing columns (p-values, odds ratio), the result now includes: - “or_ci_lower”: Lower bound of the odds ratio confidence interval. - “or_ci_upper”: Upper bound of the odds ratio confidence interval.
- variantcentrifuge.gene_burden.perform_gene_burden_analysis(df, cfg)[source]¶
Perform gene burden analysis for each gene.
Steps¶
Aggregate variant counts per gene based on the chosen mode (samples or alleles).
Perform Fisher’s exact test for each gene.
Apply multiple testing correction (FDR or Bonferroni).
Compute and add confidence intervals for the odds ratio.
- type df:
- param df:
DataFrame with per-variant case/control counts. Must include columns: “GENE”, “proband_count”, “control_count”, “proband_variant_count”, “control_variant_count”, “proband_allele_count”, “control_allele_count”.
- type df:
pd.DataFrame
- type cfg:
- param cfg:
Configuration dictionary with keys:
- “gene_burden_mode”: str
“samples” or “alleles” indicating the aggregation mode.
- “correction_method”: str
“fdr” or “bonferroni” for multiple testing correction.
- “confidence_interval_method”: str (optional)
Method for CI calculation (“normal_approx”), defaults to “normal_approx”.
- “confidence_interval_alpha”: float (optional)
Significance level for CI, defaults to 0.05 for a 95% CI.
- type cfg:
dict
- returns:
A DataFrame with gene-level burden results, including p-values, odds ratios, and confidence intervals.
The output includes:
“GENE”
“proband_count”, “control_count”
Either “proband_variant_count”, “control_variant_count” or “proband_allele_count”, “control_allele_count” depending on mode
“raw_p_value”
“corrected_p_value”
“odds_ratio”
“or_ci_lower”
“or_ci_upper”
- rtype:
pd.DataFrame
Notes
If no fisher_exact test is available, p-values default to 1.0. Confidence intervals now attempt multiple methods and fallback intervals in edge cases (Issue #31).