Gene Burden Module

Statistical methods for gene burden analysis

Gene burden analysis module.

Provides: - perform_gene_burden_analysis: Aggregates per-gene counts (samples or alleles),

performs Fisher’s exact test, calculates confidence intervals, and applies multiple testing correction.

New Features (Issue #21): - Adds confidence intervals to the gene burden metrics (e.g., odds ratio). - Confidence interval calculation method and confidence level can be configured.

Updated for Issue #31: - Improved handling of edge cases (e.g., infinite or zero odds_ratio).

Instead of returning NaN for confidence intervals, attempts a fallback method (“logit”) if the primary method fails. If still invalid, returns bounded fallback intervals to ensure meaningful output.

Configuration Additions

  • “confidence_interval_method”: str

    Method for confidence interval calculation. Supports: - “normal_approx”: Uses statsmodels’ Table2x2 normal approximation for OR CI.

    Will fallback to “logit” if normal fails, and then to bounded fallback.

  • “confidence_interval_alpha”: float

    Significance level for confidence interval calculation. Default: 0.05 for a 95% CI.

Outputs

In addition to existing columns (p-values, odds ratio), the result now includes: - “or_ci_lower”: Lower bound of the odds ratio confidence interval. - “or_ci_upper”: Upper bound of the odds ratio confidence interval.

variantcentrifuge.gene_burden.perform_gene_burden_analysis(df, cfg)[source]

Perform gene burden analysis for each gene.

Steps

  1. Aggregate variant counts per gene based on the chosen mode (samples or alleles).

  2. Perform Fisher’s exact test for each gene.

  3. Apply multiple testing correction (FDR or Bonferroni).

  4. Compute and add confidence intervals for the odds ratio.

type df:

DataFrame

param df:

DataFrame with per-variant case/control counts. Must include columns: “GENE”, “proband_count”, “control_count”, “proband_variant_count”, “control_variant_count”, “proband_allele_count”, “control_allele_count”.

type df:

pd.DataFrame

type cfg:

Dict[str, Any]

param cfg:

Configuration dictionary with keys:

  • “gene_burden_mode”: str

    “samples” or “alleles” indicating the aggregation mode.

  • “correction_method”: str

    “fdr” or “bonferroni” for multiple testing correction.

  • “confidence_interval_method”: str (optional)

    Method for CI calculation (“normal_approx”), defaults to “normal_approx”.

  • “confidence_interval_alpha”: float (optional)

    Significance level for CI, defaults to 0.05 for a 95% CI.

type cfg:

dict

returns:

A DataFrame with gene-level burden results, including p-values, odds ratios, and confidence intervals.

The output includes:

  • “GENE”

  • “proband_count”, “control_count”

  • Either “proband_variant_count”, “control_variant_count” or “proband_allele_count”, “control_allele_count” depending on mode

  • “raw_p_value”

  • “corrected_p_value”

  • “odds_ratio”

  • “or_ci_lower”

  • “or_ci_upper”

rtype:

pd.DataFrame

Notes

If no fisher_exact test is available, p-values default to 1.0. Confidence intervals now attempt multiple methods and fallback intervals in edge cases (Issue #31).