Gene Burden Module

Statistical methods for gene burden analysis

Gene burden analysis module.

Provides:

  • perform_gene_burden_analysis: Aggregates per-gene counts (samples or alleles), performs Fisher’s exact test, calculates confidence intervals, and applies multiple testing correction.

New Features (Issue #21):

  • Adds confidence intervals to the gene burden metrics (e.g., odds ratio).

  • Confidence interval calculation method and confidence level can be configured.

Updated for Issue #31:

  • Improved handling of edge cases (e.g., infinite or zero odds_ratio). Now properly detects structural zeros and applies continuity correction for zero cells to calculate meaningful confidence intervals.

  • Uses score method as primary CI calculation (more robust for sparse data).

  • Returns NaN for structural zeros where OR cannot be calculated.

Configuration Additions:

  • “confidence_interval_method”: str

    Method for confidence interval calculation. Defaults to “normal_approx”.

  • “confidence_interval_alpha”: float

    Significance level for confidence interval calculation. Default: 0.05 for a 95% CI.

  • “continuity_correction”: float

    Value to add to zero cells for continuity correction. Default: 0.5.

Outputs

In addition to existing columns (p-values, odds ratio), the result now includes: - “or_ci_lower”: Lower bound of the odds ratio confidence interval. - “or_ci_upper”: Upper bound of the odds ratio confidence interval.

variantcentrifuge.gene_burden.perform_gene_burden_analysis(df, cfg)[source]

Perform gene burden analysis for each gene.

Steps

  1. Aggregate variant counts per gene based on the chosen mode (samples or alleles).

  2. Perform Fisher’s exact test for each gene.

  3. Apply multiple testing correction (FDR or Bonferroni).

  4. Compute and add confidence intervals for the odds ratio.

type df:

DataFrame

param df:

DataFrame with per-variant case/control counts. Must include columns: “GENE”, “proband_count”, “control_count”, “proband_variant_count”, “control_variant_count”, “proband_allele_count”, “control_allele_count”.

type df:

pd.DataFrame

type cfg:

Dict[str, Any]

param cfg:

Configuration dictionary with keys:

  • “gene_burden_mode”: str

    “samples” or “alleles” indicating the aggregation mode.

  • “correction_method”: str

    “fdr” or “bonferroni” for multiple testing correction.

  • “confidence_interval_method”: str (optional)

    Method for CI calculation (“normal_approx”), defaults to “normal_approx”.

  • “confidence_interval_alpha”: float (optional)

    Significance level for CI, defaults to 0.05 for a 95% CI.

type cfg:

dict

returns:

A DataFrame with gene-level burden results, including p-values, odds ratios, and confidence intervals.

The output includes:

  • “GENE”

  • “proband_count”, “control_count”

  • Either “proband_variant_count”, “control_variant_count” or “proband_allele_count”, “control_allele_count” depending on mode

  • “raw_p_value”

  • “corrected_p_value”

  • “odds_ratio”

  • “or_ci_lower”

  • “or_ci_upper”

rtype:

pd.DataFrame

Notes

If no fisher_exact test is available, p-values default to 1.0. Confidence intervals now attempt multiple methods and fallback intervals in edge cases (Issue #31).