Association Testing Module¶
Modular rare variant association testing framework.
variantcentrifuge.association — Modular rare variant association framework.
v0.15.0 introduces a plugin-style association testing system where each statistical test implements the AssociationTest ABC. Phase 18 delivers the core abstractions and FisherExactTest (bit-identical reimplementation of gene_burden.py’s Fisher test). Subsequent phases add burden regression (Phase 19), SKAT (Phases 20-21), and ACAT-O (Phase 22).
Public API¶
AssociationTest : Abstract base class for all association tests TestResult : Dataclass holding per-gene test results AssociationConfig : Configuration dataclass with defaults mirroring gene_burden.py AssociationEngine : Orchestrator: dispatches tests, applies correction, returns DataFrame apply_correction : Standalone FDR/Bonferroni correction function
- class variantcentrifuge.association.AssociationConfig(correction_method='fdr', gene_burden_mode='samples', confidence_interval_method='normal_approx', confidence_interval_alpha=0.05, continuity_correction=0.5, covariate_file=None, covariate_columns=None, categorical_covariates=None, trait_type='binary', variant_weights='beta:1,25', variant_weight_params=None, missing_site_threshold=0.1, missing_sample_threshold=0.8, firth_max_iter=25, skat_backend='python', skat_method='SKAT', min_cases=200, max_case_control_ratio=20.0, min_case_carriers=10, diagnostics_output=None, pca_file=None, pca_tool=None, pca_components=10, coast_weights=None, coast_backend='python', coast_classification=None, gene_prior_weights=None, gene_prior_weight_column='weight', association_workers=1)[source]¶
Bases:
objectConfiguration for the association analysis framework.
All fields have defaults that mirror the equivalent keys read from the cfg dict in gene_burden.py (lines 373-377), so existing workflows can transition without changing semantics.
Fields¶
- correction_methodstr
Multiple-testing correction method. “fdr” (Benjamini-Hochberg) or “bonferroni”. Default: “fdr”.
- gene_burden_modestr
Collapsing strategy for carrier counting. “samples” = unique carrier samples (CMC/CAST); “alleles” = max allele dosage summed across samples (preserves diploid constraint). Default: “samples”.
- confidence_interval_methodstr
Method for odds ratio CI computation. Currently “normal_approx” which tries score -> normal -> logit methods in sequence. Default: “normal_approx”.
- confidence_interval_alphafloat
Significance level for CIs. 0.05 gives 95% CIs. Default: 0.05.
- continuity_correctionfloat
Value added to each cell when a zero is present, to stabilise CI computation. Default: 0.5 (Haldane-Anscombe correction).
- param correction_method:
- type correction_method:
- param gene_burden_mode:
- type gene_burden_mode:
- param confidence_interval_method:
- type confidence_interval_method:
- param confidence_interval_alpha:
- type confidence_interval_alpha:
- param continuity_correction:
- type continuity_correction:
- param covariate_file:
- type covariate_file:
- param covariate_columns:
- type covariate_columns:
- param categorical_covariates:
- type categorical_covariates:
- param trait_type:
- type trait_type:
- param variant_weights:
- type variant_weights:
- param variant_weight_params:
- type variant_weight_params:
- param missing_site_threshold:
- type missing_site_threshold:
- param missing_sample_threshold:
- type missing_sample_threshold:
- param firth_max_iter:
- type firth_max_iter:
- param skat_backend:
- type skat_backend:
- param skat_method:
- type skat_method:
- param min_cases:
- type min_cases:
- param max_case_control_ratio:
- type max_case_control_ratio:
- param min_case_carriers:
- type min_case_carriers:
- param diagnostics_output:
- type diagnostics_output:
- param pca_file:
- type pca_file:
- param pca_tool:
- type pca_tool:
- param pca_components:
- type pca_components:
- param coast_weights:
- type coast_weights:
- param coast_backend:
- type coast_backend:
- param coast_classification:
- type coast_classification:
- param gene_prior_weights:
- type gene_prior_weights:
- param gene_prior_weight_column:
- type gene_prior_weight_column:
- param association_workers:
- type association_workers:
-
covariate_file:
str|None= None¶ Path to tab/CSV covariate file (first column = sample ID). None = no covariates.
-
categorical_covariates:
list[str] |None= None¶ Column names to one-hot encode. None = auto-detect (non-numeric with <=5 levels).
-
trait_type:
str= 'binary'¶ “binary” (logistic, Firth fallback) or “quantitative” (linear OLS).
- Type:
Phenotype scale
-
variant_weights:
str= 'beta:1,25'¶ a,b” (Beta MAF weights, SKAT convention) or “uniform”.
- Type:
Weight scheme
- Type:
“beta
-
variant_weight_params:
dict|None= None¶ 40.0}).
- Type:
Extra parameters for weight schemes (e.g. {‘cadd_cap’
-
missing_site_threshold:
float= 0.1¶ Variants with >threshold fraction missing site-wide are excluded before imputation.
-
missing_sample_threshold:
float= 0.8¶ Samples with >threshold fraction missing across kept variants are excluded.
-
firth_max_iter:
int= 25¶ Maximum Newton-Raphson iterations for Firth penalized logistic regression fallback.
-
skat_backend:
str= 'python'¶ “python” (default), “r” (deprecated, R via rpy2), or “auto”.
- Type:
SKAT computation backend
-
skat_method:
str= 'SKAT'¶ “SKAT” (default), “Burden” (burden-only), or “SKATO” (omnibus).
- Type:
SKAT variant to run
-
max_case_control_ratio:
float= 20.0¶ warn if n_controls/n_cases > this value.
- Type:
Cohort-level warning threshold
-
min_case_carriers:
int= 10¶ flag genes with case_carriers < this value.
- Type:
Per-gene warning threshold
-
diagnostics_output:
str|None= None¶ Path to diagnostics output directory. None = no diagnostics output.
- __init__(correction_method='fdr', gene_burden_mode='samples', confidence_interval_method='normal_approx', confidence_interval_alpha=0.05, continuity_correction=0.5, covariate_file=None, covariate_columns=None, categorical_covariates=None, trait_type='binary', variant_weights='beta:1,25', variant_weight_params=None, missing_site_threshold=0.1, missing_sample_threshold=0.8, firth_max_iter=25, skat_backend='python', skat_method='SKAT', min_cases=200, max_case_control_ratio=20.0, min_case_carriers=10, diagnostics_output=None, pca_file=None, pca_tool=None, pca_components=10, coast_weights=None, coast_backend='python', coast_classification=None, gene_prior_weights=None, gene_prior_weight_column='weight', association_workers=1)¶
- Parameters:
correction_method (
str)gene_burden_mode (
str)confidence_interval_method (
str)confidence_interval_alpha (
float)continuity_correction (
float)trait_type (
str)variant_weights (
str)missing_site_threshold (
float)missing_sample_threshold (
float)firth_max_iter (
int)skat_backend (
str)skat_method (
str)min_cases (
int)max_case_control_ratio (
float)min_case_carriers (
int)pca_components (
int)coast_backend (
str)gene_prior_weight_column (
str)association_workers (
int)
-
pca_file:
str|None= None¶ Path to pre-computed PCA file (PLINK .eigenvec, AKT output, or generic TSV).
-
pca_tool:
str|None= None¶ ‘akt’ to invoke AKT as subprocess. None = pre-computed file only.
- Type:
PCA computation tool
-
coast_weights:
list[float] |None= None¶ [1.0, 2.0, 3.0] for BMV, DMV, PTV).
- Type:
Category weights for COAST allelic series (default
-
coast_backend:
str= 'python'¶ “python” (default), “r” (deprecated, R via rpy2), or “auto”.
- Type:
COAST computation backend
-
coast_classification:
str|None= None¶ Absolute path to a scoring/coast_classification/<model>/ directory. None = use built-in SIFT/PolyPhen hardcoded logic (backward-compatible default). Set by cli.py after resolving –coast-classification model name to a path.
-
gene_prior_weights:
str|None= None¶ Path to gene-to-weight TSV file for weighted BH FDR correction. None = standard (unweighted) BH/Bonferroni (backward-compatible default).
- class variantcentrifuge.association.AssociationEngine(tests, config)[source]¶
Bases:
objectOrchestrates association testing across multiple genes and multiple tests.
Usage¶
>>> config = AssociationConfig(gene_burden_mode="samples") >>> engine = AssociationEngine.from_names(["fisher"], config) >>> result_df = engine.run_all(gene_burden_data)
- type tests:
- param tests:
Instantiated test objects to run. Use from_names() for the common case.
- type tests:
list of AssociationTest
- type config:
- param config:
Configuration shared across all tests.
- type config:
AssociationConfig
- __init__(tests, config)[source]¶
- Parameters:
tests (
list[AssociationTest])config (
AssociationConfig)
- classmethod from_names(test_names, config)[source]¶
Construct engine from a list of test name strings.
- Parameters:
test_names (list of str) – Names of tests to run (e.g. [“fisher”]).
config (AssociationConfig) – Runtime configuration.
- Return type:
- Raises:
ValueError – If any test_name is not in the registry. The error message lists all available tests so users know what’s valid. Note: “acat_o” is NOT a valid test name here — ACAT-O is computed post-loop as a meta-test and does not run per-gene.
ImportError – If a test’s required dependencies are not installed. Raised eagerly (before any data processing) by check_dependencies().
- run_all(gene_burden_data)[source]¶
Run all registered tests across all genes and return wide-format results.
FDR correction (ARCH-03): applied only to ACAT-O p-values across all genes. Primary test columns (fisher_pvalue, burden_pvalue, etc.) are uncorrected — they are diagnostic signal decomposition, not independent hypotheses. corrected_p_value on primary TestResults is always None.
- Parameters:
gene_burden_data (list of dict) – One dict per gene with keys: GENE, proband_count, control_count, proband_carrier_count, control_carrier_count, proband_allele_count, control_allele_count, n_qualifying_variants.
- Returns:
Wide-format results. One row per gene that has at least one test result (genes where all tests returned p_value=None are excluded). Columns: gene, n_cases, n_controls, n_variants, then per-test columns: {test}_pvalue (uncorrected), plus test-aware effect columns (e.g. fisher_or or logistic_burden_beta), and finally acat_o_pvalue and acat_o_qvalue.
- Return type:
pd.DataFrame
- class variantcentrifuge.association.AssociationTest[source]¶
Bases:
ABCAbstract base class for all association tests.
Subclasses implement a specific statistical test (Fisher, burden regression, SKAT, etc.) and are registered in AssociationEngine’s test registry. Each subclass is responsible for a single gene at a time; the engine handles iteration and correction.
- name : str (property)
Short, lowercase identifier used for test registry lookup and output column prefixes (e.g. “fisher”, “burden”, “skat”).
- run(gene, contingency_data, config) TestResult[source]¶
Execute the test for one gene and return a TestResult.
- check_dependencies() None[source]¶
Raise ImportError if required optional libraries are missing. The default implementation is a no-op; subclasses override as needed.
- abstract run(gene, contingency_data, config)[source]¶
Run the association test for a single gene.
- Parameters:
gene (str) – Gene symbol being tested.
contingency_data (dict) –
Gene-level aggregated data. Keys available from gene_burden aggregation:
proband_count : int, total case samples
control_count : int, total control samples
proband_carrier_count : int, case carrier samples
control_carrier_count : int, control carrier samples
proband_allele_count : int, total alt alleles in cases
control_allele_count : int, total alt alleles in controls
n_qualifying_variants : int, variants passing filters
config (AssociationConfig) – Runtime configuration (correction method, mode, CI params).
- Returns:
Result with p_value=None when test is skipped (e.g. zero variants).
- Return type:
- effect_column_names()[source]¶
Column name suffixes for this test’s effect size output.
Returns a mapping of semantic role to column suffix. The engine uses these to build output column names as
{test_name}_{suffix}.Default returns OR-based naming (appropriate for Fisher’s exact test). Regression tests (burden, SKAT) override to return beta/SE naming.
- check_dependencies()[source]¶
Verify that required optional dependencies are available.
- Raises:
ImportError – If a required library is not installed. Called eagerly at engine construction so users get a clear error before processing begins.
- Return type:
- class variantcentrifuge.association.TestResult(gene, test_name, p_value, corrected_p_value, effect_size, ci_lower, ci_upper, se, n_cases, n_controls, n_variants, extra=<factory>)[source]¶
Bases:
objectResult from a single association test on a single gene.
Fields¶
- genestr
Gene symbol tested.
- test_namestr
Short test identifier (e.g. “fisher”, “burden”, “skat”).
- p_valuefloat | None
Raw (uncorrected) p-value. None when test is skipped (e.g. zero qualifying variants) — None is NOT the same as 1.0 (failure vs skip).
- corrected_p_valuefloat | None
Multiple-testing-corrected p-value. Populated by AssociationEngine after all genes are tested. None until correction is applied.
- effect_sizefloat | None
Primary effect size estimate. For Fisher: odds ratio. For regression tests (Phase 19+): beta coefficient.
- ci_lowerfloat | None
Lower bound of the confidence interval for effect_size.
- ci_upperfloat | None
Upper bound of the confidence interval for effect_size.
- sefloat | None
Standard error of the effect size estimate. First-class field for regression tests (burden, SKAT); None for non-regression tests (Fisher).
- n_casesint
Total number of case samples in the analysis.
- n_controlsint
Total number of control samples in the analysis.
- n_variantsint
Number of qualifying variants for this gene (after filtering).
- extradict
Test-specific ancillary data (e.g. contingency table, convergence flags). Not written to output unless a formatter explicitly accesses it.
- param gene:
- type gene:
- param test_name:
- type test_name:
- param p_value:
- type p_value:
- param corrected_p_value:
- type corrected_p_value:
- param effect_size:
- type effect_size:
- param ci_lower:
- type ci_lower:
- param ci_upper:
- type ci_upper:
- param se:
- type se:
- param n_cases:
- type n_cases:
- param n_controls:
- type n_controls:
- param n_variants:
- type n_variants:
- param extra:
- type extra:
- __init__(gene, test_name, p_value, corrected_p_value, effect_size, ci_lower, ci_upper, se, n_cases, n_controls, n_variants, extra=<factory>)¶
- variantcentrifuge.association.apply_correction(pvals, method='fdr')[source]¶
Apply multiple testing correction to a sequence of p-values.
Produces output identical to the inline smm.multipletests calls in gene_burden.py (lines 506-510) for the same inputs and method.
- Parameters:
- Returns:
Corrected p-values in the same order as input. If statsmodels is unavailable, returns the raw p-values unchanged (with a warning).
- Return type:
np.ndarray