Association Testing Module

Modular rare variant association testing framework.

variantcentrifuge.association — Modular rare variant association framework.

v0.15.0 introduces a plugin-style association testing system where each statistical test implements the AssociationTest ABC. Phase 18 delivers the core abstractions and FisherExactTest (bit-identical reimplementation of gene_burden.py’s Fisher test). Subsequent phases add burden regression (Phase 19), SKAT (Phases 20-21), and ACAT-O (Phase 22).

Public API

AssociationTest : Abstract base class for all association tests TestResult : Dataclass holding per-gene test results AssociationConfig : Configuration dataclass with defaults mirroring gene_burden.py AssociationEngine : Orchestrator: dispatches tests, applies correction, returns DataFrame apply_correction : Standalone FDR/Bonferroni correction function

class variantcentrifuge.association.AssociationConfig(correction_method='fdr', gene_burden_mode='samples', confidence_interval_method='normal_approx', confidence_interval_alpha=0.05, continuity_correction=0.5, covariate_file=None, covariate_columns=None, categorical_covariates=None, trait_type='binary', variant_weights='beta:1,25', variant_weight_params=None, missing_site_threshold=0.1, missing_sample_threshold=0.8, firth_max_iter=25, skat_backend='python', skat_method='SKAT', min_cases=200, max_case_control_ratio=20.0, min_case_carriers=10, diagnostics_output=None, pca_file=None, pca_tool=None, pca_components=10, coast_weights=None, coast_backend='python', coast_classification=None, gene_prior_weights=None, gene_prior_weight_column='weight', association_workers=1)[source]

Bases: object

Configuration for the association analysis framework.

All fields have defaults that mirror the equivalent keys read from the cfg dict in gene_burden.py (lines 373-377), so existing workflows can transition without changing semantics.

Fields

correction_methodstr

Multiple-testing correction method. “fdr” (Benjamini-Hochberg) or “bonferroni”. Default: “fdr”.

gene_burden_modestr

Collapsing strategy for carrier counting. “samples” = unique carrier samples (CMC/CAST); “alleles” = max allele dosage summed across samples (preserves diploid constraint). Default: “samples”.

confidence_interval_methodstr

Method for odds ratio CI computation. Currently “normal_approx” which tries score -> normal -> logit methods in sequence. Default: “normal_approx”.

confidence_interval_alphafloat

Significance level for CIs. 0.05 gives 95% CIs. Default: 0.05.

continuity_correctionfloat

Value added to each cell when a zero is present, to stabilise CI computation. Default: 0.5 (Haldane-Anscombe correction).

param correction_method:

type correction_method:

str

param gene_burden_mode:

type gene_burden_mode:

str

param confidence_interval_method:

type confidence_interval_method:

str

param confidence_interval_alpha:

type confidence_interval_alpha:

float

param continuity_correction:

type continuity_correction:

float

param covariate_file:

type covariate_file:

Optional[str]

param covariate_columns:

type covariate_columns:

Optional[list[str]]

param categorical_covariates:

type categorical_covariates:

Optional[list[str]]

param trait_type:

type trait_type:

str

param variant_weights:

type variant_weights:

str

param variant_weight_params:

type variant_weight_params:

Optional[dict]

param missing_site_threshold:

type missing_site_threshold:

float

param missing_sample_threshold:

type missing_sample_threshold:

float

param firth_max_iter:

type firth_max_iter:

int

param skat_backend:

type skat_backend:

str

param skat_method:

type skat_method:

str

param min_cases:

type min_cases:

int

param max_case_control_ratio:

type max_case_control_ratio:

float

param min_case_carriers:

type min_case_carriers:

int

param diagnostics_output:

type diagnostics_output:

Optional[str]

param pca_file:

type pca_file:

Optional[str]

param pca_tool:

type pca_tool:

Optional[str]

param pca_components:

type pca_components:

int

param coast_weights:

type coast_weights:

Optional[list[float]]

param coast_backend:

type coast_backend:

str

param coast_classification:

type coast_classification:

Optional[str]

param gene_prior_weights:

type gene_prior_weights:

Optional[str]

param gene_prior_weight_column:

type gene_prior_weight_column:

str

param association_workers:

type association_workers:

int

correction_method: str = 'fdr'
gene_burden_mode: str = 'samples'
confidence_interval_method: str = 'normal_approx'
confidence_interval_alpha: float = 0.05
continuity_correction: float = 0.5
covariate_file: str | None = None

Path to tab/CSV covariate file (first column = sample ID). None = no covariates.

covariate_columns: list[str] | None = None

Subset of covariate columns to use. None = all columns.

categorical_covariates: list[str] | None = None

Column names to one-hot encode. None = auto-detect (non-numeric with <=5 levels).

trait_type: str = 'binary'

“binary” (logistic, Firth fallback) or “quantitative” (linear OLS).

Type:

Phenotype scale

variant_weights: str = 'beta:1,25'

a,b” (Beta MAF weights, SKAT convention) or “uniform”.

Type:

Weight scheme

Type:

“beta

variant_weight_params: dict | None = None

40.0}).

Type:

Extra parameters for weight schemes (e.g. {‘cadd_cap’

missing_site_threshold: float = 0.1

Variants with >threshold fraction missing site-wide are excluded before imputation.

missing_sample_threshold: float = 0.8

Samples with >threshold fraction missing across kept variants are excluded.

firth_max_iter: int = 25

Maximum Newton-Raphson iterations for Firth penalized logistic regression fallback.

skat_backend: str = 'python'

“python” (default), “r” (deprecated, R via rpy2), or “auto”.

Type:

SKAT computation backend

skat_method: str = 'SKAT'

“SKAT” (default), “Burden” (burden-only), or “SKATO” (omnibus).

Type:

SKAT variant to run

min_cases: int = 200

warn if n_cases < this value.

Type:

Cohort-level warning threshold

max_case_control_ratio: float = 20.0

warn if n_controls/n_cases > this value.

Type:

Cohort-level warning threshold

min_case_carriers: int = 10

flag genes with case_carriers < this value.

Type:

Per-gene warning threshold

diagnostics_output: str | None = None

Path to diagnostics output directory. None = no diagnostics output.

__init__(correction_method='fdr', gene_burden_mode='samples', confidence_interval_method='normal_approx', confidence_interval_alpha=0.05, continuity_correction=0.5, covariate_file=None, covariate_columns=None, categorical_covariates=None, trait_type='binary', variant_weights='beta:1,25', variant_weight_params=None, missing_site_threshold=0.1, missing_sample_threshold=0.8, firth_max_iter=25, skat_backend='python', skat_method='SKAT', min_cases=200, max_case_control_ratio=20.0, min_case_carriers=10, diagnostics_output=None, pca_file=None, pca_tool=None, pca_components=10, coast_weights=None, coast_backend='python', coast_classification=None, gene_prior_weights=None, gene_prior_weight_column='weight', association_workers=1)
Parameters:
pca_file: str | None = None

Path to pre-computed PCA file (PLINK .eigenvec, AKT output, or generic TSV).

pca_tool: str | None = None

‘akt’ to invoke AKT as subprocess. None = pre-computed file only.

Type:

PCA computation tool

pca_components: int = 10
  1. Warn if >20.

Type:

Number of principal components to use. Default

coast_weights: list[float] | None = None

[1.0, 2.0, 3.0] for BMV, DMV, PTV).

Type:

Category weights for COAST allelic series (default

coast_backend: str = 'python'

“python” (default), “r” (deprecated, R via rpy2), or “auto”.

Type:

COAST computation backend

coast_classification: str | None = None

Absolute path to a scoring/coast_classification/<model>/ directory. None = use built-in SIFT/PolyPhen hardcoded logic (backward-compatible default). Set by cli.py after resolving –coast-classification model name to a path.

gene_prior_weights: str | None = None

Path to gene-to-weight TSV file for weighted BH FDR correction. None = standard (unweighted) BH/Bonferroni (backward-compatible default).

gene_prior_weight_column: str = 'weight'

‘weight’.

Type:

Column name in the weight file containing weight values. Default

association_workers: int = 1

Number of parallel worker processes for gene-level association analysis. Default: 1 (sequential). Set > 1 for parallel execution via ProcessPoolExecutor. Set -1 for os.cpu_count(). Only effective when all registered tests have parallel_safe=True.

class variantcentrifuge.association.AssociationEngine(tests, config)[source]

Bases: object

Orchestrates association testing across multiple genes and multiple tests.

Usage

>>> config = AssociationConfig(gene_burden_mode="samples")
>>> engine = AssociationEngine.from_names(["fisher"], config)
>>> result_df = engine.run_all(gene_burden_data)
type tests:

list[AssociationTest]

param tests:

Instantiated test objects to run. Use from_names() for the common case.

type tests:

list of AssociationTest

type config:

AssociationConfig

param config:

Configuration shared across all tests.

type config:

AssociationConfig

__init__(tests, config)[source]
Parameters:
classmethod from_names(test_names, config)[source]

Construct engine from a list of test name strings.

Parameters:
  • test_names (list of str) – Names of tests to run (e.g. [“fisher”]).

  • config (AssociationConfig) – Runtime configuration.

Return type:

AssociationEngine

Raises:
  • ValueError – If any test_name is not in the registry. The error message lists all available tests so users know what’s valid. Note: “acat_o” is NOT a valid test name here — ACAT-O is computed post-loop as a meta-test and does not run per-gene.

  • ImportError – If a test’s required dependencies are not installed. Raised eagerly (before any data processing) by check_dependencies().

run_all(gene_burden_data)[source]

Run all registered tests across all genes and return wide-format results.

FDR correction (ARCH-03): applied only to ACAT-O p-values across all genes. Primary test columns (fisher_pvalue, burden_pvalue, etc.) are uncorrected — they are diagnostic signal decomposition, not independent hypotheses. corrected_p_value on primary TestResults is always None.

Parameters:

gene_burden_data (list of dict) – One dict per gene with keys: GENE, proband_count, control_count, proband_carrier_count, control_carrier_count, proband_allele_count, control_allele_count, n_qualifying_variants.

Returns:

Wide-format results. One row per gene that has at least one test result (genes where all tests returned p_value=None are excluded). Columns: gene, n_cases, n_controls, n_variants, then per-test columns: {test}_pvalue (uncorrected), plus test-aware effect columns (e.g. fisher_or or logistic_burden_beta), and finally acat_o_pvalue and acat_o_qvalue.

Return type:

pd.DataFrame

class variantcentrifuge.association.AssociationTest[source]

Bases: ABC

Abstract base class for all association tests.

Subclasses implement a specific statistical test (Fisher, burden regression, SKAT, etc.) and are registered in AssociationEngine’s test registry. Each subclass is responsible for a single gene at a time; the engine handles iteration and correction.

name : str (property)

Short, lowercase identifier used for test registry lookup and output column prefixes (e.g. “fisher”, “burden”, “skat”).

run(gene, contingency_data, config) TestResult[source]

Execute the test for one gene and return a TestResult.

check_dependencies() None[source]

Raise ImportError if required optional libraries are missing. The default implementation is a no-op; subclasses override as needed.

abstract property name: str

Short lowercase identifier for this test (e.g. ‘fisher’).

abstract run(gene, contingency_data, config)[source]

Run the association test for a single gene.

Parameters:
  • gene (str) – Gene symbol being tested.

  • contingency_data (dict) –

    Gene-level aggregated data. Keys available from gene_burden aggregation:

    • proband_count : int, total case samples

    • control_count : int, total control samples

    • proband_carrier_count : int, case carrier samples

    • control_carrier_count : int, control carrier samples

    • proband_allele_count : int, total alt alleles in cases

    • control_allele_count : int, total alt alleles in controls

    • n_qualifying_variants : int, variants passing filters

  • config (AssociationConfig) – Runtime configuration (correction method, mode, CI params).

Returns:

Result with p_value=None when test is skipped (e.g. zero variants).

Return type:

TestResult

effect_column_names()[source]

Column name suffixes for this test’s effect size output.

Returns a mapping of semantic role to column suffix. The engine uses these to build output column names as {test_name}_{suffix}.

Default returns OR-based naming (appropriate for Fisher’s exact test). Regression tests (burden, SKAT) override to return beta/SE naming.

Return type:

dict[str, str | None]

check_dependencies()[source]

Verify that required optional dependencies are available.

Raises:

ImportError – If a required library is not installed. Called eagerly at engine construction so users get a clear error before processing begins.

Return type:

None

prepare(gene_count)[source]

Called by the engine before the per-gene loop.

Default is a no-op. Subclasses override to set up progress logging, emit large-panel warnings, initialize timers, etc.

Parameters:

gene_count (int) – Total number of genes that will be processed.

Return type:

None

finalize()[source]

Called by the engine after the per-gene loop completes.

Default is a no-op. Subclasses override to log aggregate timing, release resources, or perform post-run cleanup.

Return type:

None

class variantcentrifuge.association.TestResult(gene, test_name, p_value, corrected_p_value, effect_size, ci_lower, ci_upper, se, n_cases, n_controls, n_variants, extra=<factory>)[source]

Bases: object

Result from a single association test on a single gene.

Fields

genestr

Gene symbol tested.

test_namestr

Short test identifier (e.g. “fisher”, “burden”, “skat”).

p_valuefloat | None

Raw (uncorrected) p-value. None when test is skipped (e.g. zero qualifying variants) — None is NOT the same as 1.0 (failure vs skip).

corrected_p_valuefloat | None

Multiple-testing-corrected p-value. Populated by AssociationEngine after all genes are tested. None until correction is applied.

effect_sizefloat | None

Primary effect size estimate. For Fisher: odds ratio. For regression tests (Phase 19+): beta coefficient.

ci_lowerfloat | None

Lower bound of the confidence interval for effect_size.

ci_upperfloat | None

Upper bound of the confidence interval for effect_size.

sefloat | None

Standard error of the effect size estimate. First-class field for regression tests (burden, SKAT); None for non-regression tests (Fisher).

n_casesint

Total number of case samples in the analysis.

n_controlsint

Total number of control samples in the analysis.

n_variantsint

Number of qualifying variants for this gene (after filtering).

extradict

Test-specific ancillary data (e.g. contingency table, convergence flags). Not written to output unless a formatter explicitly accesses it.

param gene:

type gene:

str

param test_name:

type test_name:

str

param p_value:

type p_value:

float | None

param corrected_p_value:

type corrected_p_value:

float | None

param effect_size:

type effect_size:

float | None

param ci_lower:

type ci_lower:

float | None

param ci_upper:

type ci_upper:

float | None

param se:

type se:

float | None

param n_cases:

type n_cases:

int

param n_controls:

type n_controls:

int

param n_variants:

type n_variants:

int

param extra:

type extra:

dict[str, Any]

gene: str
test_name: str
p_value: float | None
corrected_p_value: float | None
effect_size: float | None
ci_lower: float | None
ci_upper: float | None
se: float | None
n_cases: int
n_controls: int
n_variants: int
extra: dict[str, Any]
__init__(gene, test_name, p_value, corrected_p_value, effect_size, ci_lower, ci_upper, se, n_cases, n_controls, n_variants, extra=<factory>)
Parameters:
variantcentrifuge.association.apply_correction(pvals, method='fdr')[source]

Apply multiple testing correction to a sequence of p-values.

Produces output identical to the inline smm.multipletests calls in gene_burden.py (lines 506-510) for the same inputs and method.

Parameters:
  • pvals (list of float or np.ndarray) – Raw p-values to correct. Must be in [0, 1].

  • method (str) – Correction method: “fdr” (Benjamini-Hochberg, default) or “bonferroni”. Any other value is treated as “fdr”.

Returns:

Corrected p-values in the same order as input. If statsmodels is unavailable, returns the raw p-values unchanged (with a warning).

Return type:

np.ndarray