Association Testing Module¶

Modular rare variant association testing framework.

variantcentrifuge.association — Modular rare variant association framework.

v0.15.0 introduces a plugin-style association testing system where each statistical test implements the AssociationTest ABC. Phase 18 delivers the core abstractions and FisherExactTest (bit-identical reimplementation of gene_burden.py’s Fisher test). Subsequent phases add burden regression (Phase 19), SKAT (Phases 20-21), and ACAT-O (Phase 22).

Public API¶

AssociationTest : Abstract base class for all association tests TestResult : Dataclass holding per-gene test results AssociationConfig : Configuration dataclass with defaults mirroring gene_burden.py AssociationEngine : Orchestrator: dispatches tests, applies correction, returns DataFrame apply_correction : Standalone FDR/Bonferroni correction function

class variantcentrifuge.association.AssociationConfig(correction_method='fdr', gene_burden_mode='samples', confidence_interval_method='normal_approx', confidence_interval_alpha=0.05, continuity_correction=0.5, covariate_file=None, covariate_columns=None, categorical_covariates=None, trait_type='binary', variant_weights='beta:1,25', variant_weight_params=None, missing_site_threshold=0.1, missing_sample_threshold=0.8, firth_max_iter=25, skat_backend='python', skat_method='SKAT', min_cases=200, max_case_control_ratio=20.0, min_case_carriers=10, diagnostics_output=None, pca_file=None, pca_tool=None, pca_components=10, coast_weights=None, coast_backend='python', coast_classification=None, gene_prior_weights=None, gene_prior_weight_column='weight', association_workers=1)[source]¶

Bases: object

Configuration for the association analysis framework.

All fields have defaults that mirror the equivalent keys read from the cfg dict in gene_burden.py (lines 373-377), so existing workflows can transition without changing semantics.

Fields¶

correction_methodstr: Multiple-testing correction method. “fdr” (Benjamini-Hochberg) or “bonferroni”. Default: “fdr”.
gene_burden_modestr: Collapsing strategy for carrier counting. “samples” = unique carrier samples (CMC/CAST); “alleles” = max allele dosage summed across samples (preserves diploid constraint). Default: “samples”.
confidence_interval_methodstr: Method for odds ratio CI computation. Currently “normal_approx” which tries score -> normal -> logit methods in sequence. Default: “normal_approx”.
confidence_interval_alphafloat: Significance level for CIs. 0.05 gives 95% CIs. Default: 0.05.
continuity_correctionfloat: Value added to each cell when a zero is present, to stabilise CI computation. Default: 0.5 (Haldane-Anscombe correction).

param correction_method:
type correction_method:: str
param gene_burden_mode:
type gene_burden_mode:: str
param confidence_interval_method:
type confidence_interval_method:: str
param confidence_interval_alpha:
type confidence_interval_alpha:: float
param continuity_correction:
type continuity_correction:: float
param covariate_file:
type covariate_file:: Optional[str]
param covariate_columns:
type covariate_columns:: Optional[list[str]]
param categorical_covariates:
type categorical_covariates:: Optional[list[str]]
param trait_type:
type trait_type:: str
param variant_weights:
type variant_weights:: str
param variant_weight_params:
type variant_weight_params:: Optional[dict]
param missing_site_threshold:
type missing_site_threshold:: float
param missing_sample_threshold:
type missing_sample_threshold:: float
param firth_max_iter:
type firth_max_iter:: int
param skat_backend:
type skat_backend:: str
param skat_method:
type skat_method:: str
param min_cases:
type min_cases:: int
param max_case_control_ratio:
type max_case_control_ratio:: float
param min_case_carriers:
type min_case_carriers:: int
param diagnostics_output:
type diagnostics_output:: Optional[str]
param pca_file:
type pca_file:: Optional[str]
param pca_tool:
type pca_tool:: Optional[str]
param pca_components:
type pca_components:: int
param coast_weights:
type coast_weights:: Optional[list[float]]
param coast_backend:
type coast_backend:: str
param coast_classification:
type coast_classification:: Optional[str]
param gene_prior_weights:
type gene_prior_weights:: Optional[str]
param gene_prior_weight_column:
type gene_prior_weight_column:: str
param association_workers:
type association_workers:: int

correction_method: str = 'fdr'¶

gene_burden_mode: str = 'samples'¶

confidence_interval_method: str = 'normal_approx'¶

confidence_interval_alpha: float = 0.05¶

continuity_correction: float = 0.5¶

covariate_file: str | None = None¶: Path to tab/CSV covariate file (first column = sample ID). None = no covariates.

covariate_columns: list[str] | None = None¶: Subset of covariate columns to use. None = all columns.

categorical_covariates: list[str] | None = None¶: Column names to one-hot encode. None = auto-detect (non-numeric with <=5 levels).

trait_type: str = 'binary'¶

“binary” (logistic, Firth fallback) or “quantitative” (linear OLS).

Type:: Phenotype scale

variant_weights: str = 'beta:1,25'¶

a,b” (Beta MAF weights, SKAT convention) or “uniform”.

Type:: Weight scheme
Type:: “beta

variant_weight_params: dict | None = None¶

40.0}).

Type:: Extra parameters for weight schemes (e.g. {‘cadd_cap’

missing_site_threshold: float = 0.1¶: Variants with >threshold fraction missing site-wide are excluded before imputation.

missing_sample_threshold: float = 0.8¶: Samples with >threshold fraction missing across kept variants are excluded.

firth_max_iter: int = 25¶: Maximum Newton-Raphson iterations for Firth penalized logistic regression fallback.

skat_backend: str = 'python'¶

“python” (default), “r” (deprecated, R via rpy2), or “auto”.

Type:: SKAT computation backend

skat_method: str = 'SKAT'¶

“SKAT” (default), “Burden” (burden-only), or “SKATO” (omnibus).

Type:: SKAT variant to run

min_cases: int = 200¶

warn if n_cases < this value.

Type:: Cohort-level warning threshold

max_case_control_ratio: float = 20.0¶

warn if n_controls/n_cases > this value.

Type:: Cohort-level warning threshold

min_case_carriers: int = 10¶

flag genes with case_carriers < this value.

Type:: Per-gene warning threshold

diagnostics_output: str | None = None¶: Path to diagnostics output directory. None = no diagnostics output.

__init__(correction_method='fdr', gene_burden_mode='samples', confidence_interval_method='normal_approx', confidence_interval_alpha=0.05, continuity_correction=0.5, covariate_file=None, covariate_columns=None, categorical_covariates=None, trait_type='binary', variant_weights='beta:1,25', variant_weight_params=None, missing_site_threshold=0.1, missing_sample_threshold=0.8, firth_max_iter=25, skat_backend='python', skat_method='SKAT', min_cases=200, max_case_control_ratio=20.0, min_case_carriers=10, diagnostics_output=None, pca_file=None, pca_tool=None, pca_components=10, coast_weights=None, coast_backend='python', coast_classification=None, gene_prior_weights=None, gene_prior_weight_column='weight', association_workers=1)¶

Parameters:

correction_method (str)
gene_burden_mode (str)
confidence_interval_method (str)
confidence_interval_alpha (float)
continuity_correction (float)
covariate_file (Optional[str])
covariate_columns (Optional[list[str]])
categorical_covariates (Optional[list[str]])
trait_type (str)
variant_weights (str)
variant_weight_params (Optional[dict])
missing_site_threshold (float)
missing_sample_threshold (float)
firth_max_iter (int)
skat_backend (str)
skat_method (str)
min_cases (int)
max_case_control_ratio (float)
min_case_carriers (int)
diagnostics_output (Optional[str])
pca_file (Optional[str])
pca_tool (Optional[str])
pca_components (int)
coast_weights (Optional[list[float]])
coast_backend (str)
coast_classification (Optional[str])
gene_prior_weights (Optional[str])
gene_prior_weight_column (str)
association_workers (int)

pca_file: str | None = None¶: Path to pre-computed PCA file (PLINK .eigenvec, AKT output, or generic TSV).

pca_tool: str | None = None¶

‘akt’ to invoke AKT as subprocess. None = pre-computed file only.

Type:: PCA computation tool

pca_components: int = 10¶

Warn if >20.

Type:: Number of principal components to use. Default

coast_weights: list[float] | None = None¶

[1.0, 2.0, 3.0] for BMV, DMV, PTV).

Type:: Category weights for COAST allelic series (default

coast_backend: str = 'python'¶

“python” (default), “r” (deprecated, R via rpy2), or “auto”.

Type:: COAST computation backend

coast_classification: str | None = None¶: Absolute path to a scoring/coast_classification/<model>/ directory. None = use built-in SIFT/PolyPhen hardcoded logic (backward-compatible default). Set by cli.py after resolving –coast-classification model name to a path.

gene_prior_weights: str | None = None¶: Path to gene-to-weight TSV file for weighted BH FDR correction. None = standard (unweighted) BH/Bonferroni (backward-compatible default).

gene_prior_weight_column: str = 'weight'¶

‘weight’.

Type:: Column name in the weight file containing weight values. Default

association_workers: int = 1¶: Number of parallel worker processes for gene-level association analysis. Default: 1 (sequential). Set > 1 for parallel execution via ProcessPoolExecutor. Set -1 for os.cpu_count(). Only effective when all registered tests have parallel_safe=True.

class variantcentrifuge.association.AssociationEngine(tests, config)[source]¶

Bases: object

Orchestrates association testing across multiple genes and multiple tests.

Usage¶

>>> config = AssociationConfig(gene_burden_mode="samples")
>>> engine = AssociationEngine.from_names(["fisher"], config)
>>> result_df = engine.run_all(gene_burden_data)

type tests:: list[AssociationTest]
param tests:: Instantiated test objects to run. Use from_names() for the common case.
type tests:: list of AssociationTest
type config:: AssociationConfig
param config:: Configuration shared across all tests.
type config:: AssociationConfig

__init__(tests, config)[source]¶

Parameters:

tests (list[AssociationTest])
config (AssociationConfig)

classmethod from_names(test_names, config)[source]¶

Construct engine from a list of test name strings.

Parameters:

test_names (list of str) – Names of tests to run (e.g. [“fisher”]).
config (AssociationConfig) – Runtime configuration.

Return type:

AssociationEngine

Raises:

ValueError – If any test_name is not in the registry. The error message lists all available tests so users know what’s valid. Note: “acat_o” is NOT a valid test name here — ACAT-O is computed post-loop as a meta-test and does not run per-gene.
ImportError – If a test’s required dependencies are not installed. Raised eagerly (before any data processing) by check_dependencies().

run_all(gene_burden_data)[source]¶

Run all registered tests across all genes and return wide-format results.

FDR correction (ARCH-03): applied only to ACAT-O p-values across all genes. Primary test columns (fisher_pvalue, burden_pvalue, etc.) are uncorrected — they are diagnostic signal decomposition, not independent hypotheses. corrected_p_value on primary TestResults is always None.

Parameters:: gene_burden_data (list of dict) – One dict per gene with keys: GENE, proband_count, control_count, proband_carrier_count, control_carrier_count, proband_allele_count, control_allele_count, n_qualifying_variants.
Returns:: Wide-format results. One row per gene that has at least one test result (genes where all tests returned p_value=None are excluded). Columns: gene, n_cases, n_controls, n_variants, then per-test columns: {test}_pvalue (uncorrected), plus test-aware effect columns (e.g. fisher_or or logistic_burden_beta), and finally acat_o_pvalue and acat_o_qvalue.
Return type:: pd.DataFrame

class variantcentrifuge.association.AssociationTest[source]¶

Bases: ABC

Abstract base class for all association tests.

Subclasses implement a specific statistical test (Fisher, burden regression, SKAT, etc.) and are registered in AssociationEngine’s test registry. Each subclass is responsible for a single gene at a time; the engine handles iteration and correction.

name : str (property): Short, lowercase identifier used for test registry lookup and output column prefixes (e.g. “fisher”, “burden”, “skat”).

run(gene, contingency_data, config) → TestResult[source]¶: Execute the test for one gene and return a TestResult.

check_dependencies() → None[source]¶: Raise ImportError if required optional libraries are missing. The default implementation is a no-op; subclasses override as needed.

abstract property name: str¶: Short lowercase identifier for this test (e.g. ‘fisher’).

abstract run(gene, contingency_data, config)[source]¶

Run the association test for a single gene.

Parameters:

gene (str) – Gene symbol being tested.
contingency_data (dict) –
Gene-level aggregated data. Keys available from gene_burden aggregation:

System Message: ERROR/3 (/home/runner/work/variantcentrifuge/variantcentrifuge/variantcentrifuge/association/base.py:docstring of variantcentrifuge.association.base.AssociationTest.run, line 9)

Unexpected indentation.
- proband_count : int, total case samples
- control_count : int, total control samples
- proband_carrier_count : int, case carrier samples
- control_carrier_count : int, control carrier samples
- proband_allele_count : int, total alt alleles in cases
- control_allele_count : int, total alt alleles in controls
- n_qualifying_variants : int, variants passing filters
config (AssociationConfig) – Runtime configuration (correction method, mode, CI params).

Returns:

Result with p_value=None when test is skipped (e.g. zero variants).

Return type:

TestResult

effect_column_names()[source]¶

Column name suffixes for this test’s effect size output.

Returns a mapping of semantic role to column suffix. The engine uses these to build output column names as {test_name}_{suffix}.

Default returns OR-based naming (appropriate for Fisher’s exact test). Regression tests (burden, SKAT) override to return beta/SE naming.

Return type:: dict[str, str | None]

check_dependencies()[source]¶

Verify that required optional dependencies are available.

Raises:: ImportError – If a required library is not installed. Called eagerly at engine construction so users get a clear error before processing begins.
Return type:: None

prepare(gene_count)[source]¶

Called by the engine before the per-gene loop.

Default is a no-op. Subclasses override to set up progress logging, emit large-panel warnings, initialize timers, etc.

Parameters:: gene_count (int) – Total number of genes that will be processed.
Return type:: None

finalize()[source]¶

Called by the engine after the per-gene loop completes.

Default is a no-op. Subclasses override to log aggregate timing, release resources, or perform post-run cleanup.

Return type:: None

class variantcentrifuge.association.TestResult(gene, test_name, p_value, corrected_p_value, effect_size, ci_lower, ci_upper, se, n_cases, n_controls, n_variants, extra=<factory>)[source]¶

Bases: object

Result from a single association test on a single gene.

Fields¶

genestr: Gene symbol tested.
test_namestr: Short test identifier (e.g. “fisher”, “burden”, “skat”).
p_valuefloat | None: Raw (uncorrected) p-value. None when test is skipped (e.g. zero qualifying variants) — None is NOT the same as 1.0 (failure vs skip).
corrected_p_valuefloat | None: Multiple-testing-corrected p-value. Populated by AssociationEngine after all genes are tested. None until correction is applied.
effect_sizefloat | None: Primary effect size estimate. For Fisher: odds ratio. For regression tests (Phase 19+): beta coefficient.
ci_lowerfloat | None: Lower bound of the confidence interval for effect_size.
ci_upperfloat | None: Upper bound of the confidence interval for effect_size.
sefloat | None: Standard error of the effect size estimate. First-class field for regression tests (burden, SKAT); None for non-regression tests (Fisher).
n_casesint: Total number of case samples in the analysis.
n_controlsint: Total number of control samples in the analysis.
n_variantsint: Number of qualifying variants for this gene (after filtering).
extradict: Test-specific ancillary data (e.g. contingency table, convergence flags). Not written to output unless a formatter explicitly accesses it.

param gene:
type gene:: str
param test_name:
type test_name:: str
param p_value:
type p_value:: float | None
param corrected_p_value:
type corrected_p_value:: float | None
param effect_size:
type effect_size:: float | None
param ci_lower:
type ci_lower:: float | None
param ci_upper:
type ci_upper:: float | None
param se:
type se:: float | None
param n_cases:
type n_cases:: int
param n_controls:
type n_controls:: int
param n_variants:
type n_variants:: int
param extra:
type extra:: dict[str, Any]

gene: str¶

test_name: str¶

p_value: float | None¶

corrected_p_value: float | None¶

effect_size: float | None¶

ci_lower: float | None¶

ci_upper: float | None¶

se: float | None¶

n_cases: int¶

n_controls: int¶

n_variants: int¶

extra: dict[str, Any]¶

__init__(gene, test_name, p_value, corrected_p_value, effect_size, ci_lower, ci_upper, se, n_cases, n_controls, n_variants, extra=<factory>)¶

Parameters:

gene (str)
test_name (str)
p_value (float | None)
corrected_p_value (float | None)
effect_size (float | None)
ci_lower (float | None)
ci_upper (float | None)
se (float | None)
n_cases (int)
n_controls (int)
n_variants (int)
extra (dict[str, Any])

variantcentrifuge.association.apply_correction(pvals, method='fdr')[source]¶

Apply multiple testing correction to a sequence of p-values.

Produces output identical to the inline smm.multipletests calls in gene_burden.py (lines 506-510) for the same inputs and method.

Parameters:

pvals (list of float or np.ndarray) – Raw p-values to correct. Must be in [0, 1].
method (str) – Correction method: “fdr” (Benjamini-Hochberg, default) or “bonferroni”. Any other value is treated as “fdr”.

Returns:

Corrected p-values in the same order as input. If statsmodels is unavailable, returns the raw p-values unchanged (with a warning).

Return type:

np.ndarray