# Association Testing Guide

This guide covers all association testing functionality in VariantCentrifuge v0.16.0, including
Fisher's exact test, logistic and linear burden tests, SKAT/SKAT-O, COAST (allelic series),
the ACAT-O omnibus, and gene-level FDR weighting. It assumes VariantCentrifuge is already installed and you have a multi-sample
VCF annotated with functional predictions and population allele frequencies.

---

## Quick Start

Run a Fisher's exact test on all genes in your gene file:

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --output-file results.tsv
```

Output: `results.association.tsv` with columns:

```
gene  n_cases  n_controls  n_variants  fisher_pvalue  fisher_or  fisher_or_ci_lower  fisher_or_ci_upper  acat_o_pvalue  acat_o_qvalue  warnings
GENE1 312      488         7           0.00031         3.82       1.87                7.78                0.00031         0.0047
GENE2 312      488         3           0.41            1.23       0.74                2.05                0.41            1.0
...
```

**Primary significance measure:** `acat_o_qvalue` — this is the FDR-corrected omnibus
p-value. Use it for all significance decisions.

---

## Setup: Covariates

### Covariate File Format

Covariate files are TSV or CSV with sample IDs in the first column and a header row:

```
sample_id   age   sex   batch   gfr
SAMPLE_001  45    F     batch1  62.3
SAMPLE_002  51    M     batch1  88.1
SAMPLE_003  38    F     batch2  45.7
...
```

- First column: sample IDs matching VCF sample IDs exactly (case-sensitive)
- Delimiter: auto-detected (`.tsv`/`.tab` → tab, `.csv` → comma, otherwise sniffed)
- Header row required

### CLI Flags

```bash
--covariate-file covariates.tsv     # Path to covariate file
--covariates age,sex,batch          # Use only these columns (default: all)
--categorical-covariates sex,batch  # Force one-hot encoding for these columns
```

**Auto-detection of categorical columns:** Non-numeric columns with 5 or fewer unique values are
automatically one-hot encoded. Override with `--categorical-covariates` if needed (e.g., a batch
column with many unique integer codes that is still categorical).

### Example with Covariates

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests logistic_burden \
  --covariate-file covariates.tsv \
  --covariates age,sex,batch \
  --categorical-covariates sex,batch \
  --output-file results.tsv
```

:::{note}
`--covariate-file` requires `--perform-association`. Specifying it without enabling association
analysis raises a validation error at startup.
:::

---

## Setup: PCA

Population stratification is a major confounder in case-control association studies. Include
principal components as covariates to correct for it.

### Supported PCA File Formats

VariantCentrifuge auto-detects three PCA file formats:

**PLINK .eigenvec with header** (line starts with `#FID` or `FID`):
```
#FID  IID         PC1       PC2       PC3
FAM1  SAMPLE_001  0.0142    -0.0231   0.0089
FAM1  SAMPLE_002  -0.0305   0.0118    -0.0042
```

**PLINK .eigenvec without header** (two non-numeric ID columns then numeric):
```
FAM1  SAMPLE_001  0.0142    -0.0231   0.0089
FAM1  SAMPLE_002  -0.0305   0.0118    -0.0042
```

**AKT stdout / generic TSV** (one sample ID column then numeric):
```
sample_id   PC1       PC2       PC3
SAMPLE_001  0.0142    -0.0231   0.0089
SAMPLE_002  -0.0305   0.0118    -0.0042
```

PLINK format uses IID (column 2) as the sample identifier — this matches VCF sample names.

### CLI Flags

```bash
--pca-file ancestry.eigenvec    # Pre-computed PCA file
--pca-components 10             # Number of PCs to include (default: 10, warn if >20)
--pca-tool akt                  # Invoke AKT as subprocess to compute PCA on-the-fly
```

PCA columns are automatically appended to the covariate matrix before any regression. If you
specify both `--covariate-file` and `--pca-file`, PCs are added as additional columns.

:::{warning}
Using `--pca-tool akt` requires AKT to be in your PATH. A missing AKT binary raises a hard error
(not a silent skip) to prevent undetected misconfiguration.
:::

:::{tip}
Using more than 20 PCs (`--pca-components >20`) triggers a warning. Over-conditioning on PCs can
absorb true signal from rare variant tests.
:::

---

## Simple Tests: Fisher's Exact Test

Fisher's exact test uses a 2×2 contingency table (carrier vs non-carrier in cases vs controls)
to detect carrier enrichment. It requires no distributional assumptions and works at any sample
size, making it an appropriate first-pass screen for all cohorts. It does not adjust for
covariates.

**When to use:** Any cohort size; first-pass screening; when covariate adjustment is not required.

**CLI name:** `fisher` (default when `--perform-association` is specified without
`--association-tests`)

### CLI Example

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests fisher \
  --output-file results.tsv
```

### Output Columns

| Column | Description |
|--------|-------------|
| `fisher_pvalue` | Raw (uncorrected) Fisher p-value |
| `fisher_or` | Odds ratio |
| `fisher_or_ci_lower` | OR 95% CI lower bound |
| `fisher_or_ci_upper` | OR 95% CI upper bound |

### Collapsing Modes

The `gene_burden_mode` setting controls how variants are collapsed per sample:

- `samples` (default): Binary carrier indicator — a sample is a carrier if it has ≥1 qualifying
  variant allele. This is the CMC/CAST approach.
- `alleles`: Maximum dosage across variants per sample (0, 1, or 2). More sensitive to dosage
  effects.

Set via JSON config (see [JSON Config](#tuning-json-config)):
```json
{ "association": { "gene_burden_mode": "alleles" } }
```

---

## Simple Tests: Burden Tests

Burden tests collapse all qualifying variants in a gene into a single weighted burden score per
sample, then regress phenotype on that score. This approach maximises power when variants in a
gene all act in the same direction.

### Logistic Burden Test (Binary Traits)

The logistic burden test fits a logistic regression model with the weighted burden score (plus
optional covariates) as predictors and case/control status as outcome. It automatically falls
back to Firth penalised likelihood when standard logistic regression fails to converge or when
quasi-complete separation is detected (BSE > 100).

**When to use:** Binary trait (case/control), moderate-to-large cohort (≥50 cases), covariate
adjustment required.

**CLI name:** `logistic_burden`

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests logistic_burden \
  --trait-type binary \
  --covariate-file covariates.tsv \
  --covariates age,sex,batch \
  --categorical-covariates sex,batch \
  --output-file results.tsv
```

**Output columns:**

| Column | Description |
|--------|-------------|
| `logistic_burden_pvalue` | Raw p-value |
| `logistic_burden_beta` | Beta coefficient (log-odds per unit burden) |
| `logistic_burden_se` | Standard error of beta |
| `logistic_burden_beta_ci_lower` | Beta 95% CI lower bound |
| `logistic_burden_beta_ci_upper` | Beta 95% CI upper bound |

:::{note}
Burden tests report **beta + SE**, not odds ratios. The per-unit burden OR would be numerically
close to 1.0 due to beta-distributed variant weights and is not interpretable. Beta matches the
SKAT/SAIGE-GENE convention.
:::

**Separation warning codes:** `PERFECT_SEPARATION`, `QUASI_SEPARATION`, `FIRTH_CONVERGE_FAIL`,
`ZERO_CARRIERS_ONE_GROUP`, `LOW_CARRIER_COUNT`. These appear in the `warnings` column.

### Linear Burden Test (Quantitative Traits)

The linear burden test fits an OLS regression with the weighted burden score (plus optional
covariates) as predictors and a continuous phenotype as outcome.

**When to use:** Quantitative trait (biomarker, laboratory measurement), covariate adjustment
required.

**CLI name:** `linear_burden`

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column gfr \
  --perform-association \
  --association-tests linear_burden \
  --trait-type quantitative \
  --covariate-file covariates.tsv \
  --covariates age,sex,batch \
  --output-file results.tsv
```

**Output columns:**

| Column | Description |
|--------|-------------|
| `linear_burden_pvalue` | Raw p-value |
| `linear_burden_beta` | Beta coefficient |
| `linear_burden_se` | Standard error |
| `linear_burden_beta_ci_lower` | Beta 95% CI lower bound |
| `linear_burden_beta_ci_upper` | Beta 95% CI upper bound |

---

## Advanced Tests: SKAT / SKAT-O

The Sequence Kernel Association Test (SKAT) is a score-based quadratic kernel test that does not
assume all variants in a gene have the same direction of effect. It is most powerful when effects
are heterogeneous — some variants protective, others harmful, or only a subset truly causal.
SKAT-O (optimal SKAT) mixes SKAT and burden test statistics to maximise power over a range of
effect architectures, estimating the optimal mixing parameter rho.

**When to use:** Large cohorts (≥100 cases recommended); when effect direction is uncertain or
heterogeneous.

**CLI name:** `skat` (activates SKAT-O by default)

### Backends

| Backend | CLI Flag | Notes |
|---------|----------|-------|
| Pure Python (default) | `--skat-backend python` | numpy/scipy; thread-safe; no R required |
| Auto | `--skat-backend auto` | Selects Python (same as `python` in v0.15.0) |
| R (deprecated) | `--skat-backend r` | rpy2 + R SKAT package; not thread-safe; raises DeprecationWarning |

The Python backend uses the compiled Davies method (qfc.c via ctypes) for exact p-values with a
saddlepoint approximation fallback for extreme values, and Liu fallback as a last resort. Results
are validated to match the R SKAT package within 10%.

### CLI Example

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests skat \
  --trait-type binary \
  --covariate-file covariates.tsv \
  --pca-file ancestry.eigenvec \
  --output-file results.tsv
```

### Output Columns

| Column | Description |
|--------|-------------|
| `skat_o_pvalue` | Raw SKAT-O p-value |
| `skat_o_rho` | Optimal rho (mixing parameter between SKAT and burden) |
| `acat_v_p` | ACAT-V per-variant score (internal; feeds ACAT-O) |

SKAT produces no effect size estimate. The p-value is the primary output.

### ACAT-V

When SKAT runs, ACAT-V (Aggregated Cauchy Association Test — Variant) is automatically computed
as a per-variant score combination. It is not a separate test to activate; its p-value is stored
in `acat_v_p` and fed into the ACAT-O omnibus alongside the SKAT p-value.

---

## Advanced Tests: COAST

COAST (Combined Omnibus Association Test for allelic Series) tests the hypothesis that variants
with greater predicted functional impact have larger effect sizes, following the ordered
alternative PTV > DMV > BMV. It combines six burden components (BMV sum/max, DMV sum/max, PTV
sum/max) and an allelic SKAT via Cauchy combination. This test is appropriate when you have a
biological prior that PTVs are most damaging, with missense variants spanning a range of
functional severity.

**Reference:** McCaw et al. AJHG 2023.

**When to use:** When a monotone effect size trend (PTVs most damaging) is expected; requires
SIFT and PolyPhen predictions in VCF annotations.

**CLI name:** `coast`

### Variant Classification

COAST classifies qualifying variants into three categories:

| Category | Code | Criteria |
|----------|------|---------|
| **PTV** (Protein-Truncating) | 3 | HIGH impact AND effect in {stop_gained, frameshift_variant, splice_acceptor_variant, splice_donor_variant} |
| **DMV** (Deleterious Missense) | 2 | Missense + (SIFT: D OR PolyPhen: P/D) — either prediction suffices |
| **BMV** (Benign Missense) | 1 | Missense + (SIFT: T) + (PolyPhen: B) — both required |
| Unclassified | 0 | Missense without SIFT/PolyPhen predictions — excluded from COAST only |

Annotation columns tried (first found wins):
- SIFT: `dbNSFP_SIFT_pred`, `SIFT_pred`, `sift_pred`
- PolyPhen: `dbNSFP_Polyphen2_HDIV_pred`, `dbNSFP_Polyphen2_HVAR_pred`, `Polyphen2_HDIV_pred`, `Polyphen2_HVAR_pred`, `polyphen2_hdiv_pred`

Unclassified variants (code 0) are excluded from COAST but remain in SKAT and burden tests.

### Backends

| Backend | CLI Flag | Notes |
|---------|----------|-------|
| Pure Python (default) | `--coast-backend python` | No R required; thread-safe; matches R AllelicSeries |
| Auto | `--coast-backend auto` | Probes rpy2 AND AllelicSeries R package; falls back to Python |
| R (deprecated) | `--coast-backend r` | Requires rpy2 + R AllelicSeries package; raises DeprecationWarning |

### CLI Example

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests coast \
  --trait-type binary \
  --covariate-file covariates.tsv \
  --coast-weights 1,2,3 \
  --output-file results.tsv
```

### COAST Weights

`--coast-weights BMV,DMV,PTV` sets the ordered alternative weights. Default is `1,2,3`
(PTVs three times the weight of BMVs). Adjust based on biological expectation.

### Output Columns

| Column | Description |
|--------|-------------|
| `coast_pvalue` | Raw COAST omnibus p-value |
| `coast_burden_pvalue` | Cauchy of the 6 burden components |
| `coast_skat_pvalue` | Allelic SKAT p-value |
| `coast_n_bmv` | Number of BMV variants |
| `coast_n_dmv` | Number of DMV variants |
| `coast_n_ptv` | Number of PTV variants |

---

## Combining Tests: ACAT-O

ACAT-O (Aggregated Cauchy Association Test — Omnibus) combines all primary test p-values for a
gene into a single omnibus p-value using the Cauchy combination. This captures signal regardless
of which test is most powerful for a given gene. The Cauchy combination is particularly robust to
correlation between test statistics.

**References:** Liu et al. 2019 AJHG; Liu and Xie 2020 JASA.

ACAT-O is **not** a test you activate separately — it is computed automatically after all
selected tests have run. When SKAT is active, ACAT-V is also included in the combination.

### FDR Correction Strategy

A **single** Benjamini-Hochberg FDR correction pass is applied to the ACAT-O p-values across all
genes. Individual test p-values (`fisher_pvalue`, `skat_o_pvalue`, etc.) are **not** corrected.

:::{warning}
Do not apply additional correction to individual test p-values, and do not apply FDR separately
per test. Both approaches are statistically incorrect given this design. Use
`acat_o_qvalue` as your primary significance measure.
:::

### Output Columns

| Column | Description |
|--------|-------------|
| `acat_o_pvalue` | Raw (uncorrected) ACAT-O omnibus p-value |
| `acat_o_qvalue` | FDR-corrected (Benjamini-Hochberg) ACAT-O p-value |

### Pass-Through Behaviour

When only one test is active, `acat_o_pvalue` equals that test's raw p-value (pass-through).
FDR correction is still applied across genes. This is by design — the omnibus is always a safe
primary measure.

---

## Tuning: Variant Weights

Variant weights control the contribution of each qualifying variant to the burden score in
logistic/linear burden tests and to the SKAT kernel matrix. Well-chosen weights improve power
by upweighting likely pathogenic variants.

### Weight Schemes

| Scheme | CLI Spec | Description |
|--------|---------|-------------|
| Beta(MAF) | `beta:1,25` | Default. Beta(MAF; a=1, b=25) — rare variants upweighted; matches R SKAT convention |
| Beta custom | `beta:a,b` | Custom Beta distribution parameters, e.g. `beta:1,1` for flat weighting |
| Uniform | `uniform` | All qualifying variants receive weight 1.0 |
| CADD | `cadd` | Beta(MAF) × min(CADD_phred / 40, 1.0); requires `CADD_phred` or `dbNSFP_CADD_phred` |
| REVEL | `revel` | Beta(MAF) × REVEL_score [0,1]; requires `REVEL_score` or `dbNSFP_REVEL_score` |
| Combined | `combined` | Beta(MAF) × functional score; prefers CADD, falls back to REVEL |

### CLI Flags

```bash
--variant-weights beta:1,25              # Weight scheme (default: beta:1,25)
--variant-weight-params '{"cadd_cap":30}'  # JSON string with extra parameters
```

`--variant-weight-params` accepts a JSON string for scheme-specific tuning. Current supported
parameter: `cadd_cap` (float, default 40.0) — cap CADD phred score before normalisation.

### Fallback Behaviour for Missing Annotations

When functional annotations are absent:
- LoF (HIGH impact) variants: functional weight 1.0 (conservative, no down-weighting)
- Missense without SIFT/PolyPhen/CADD/REVEL: functional weight 1.0
- A warning is logged with per-category missing annotation counts

:::{tip}
Start with the default `beta:1,25` weights. Switch to `combined` or `cadd` only when CADD
annotations are present in your VCF and you want to differentiate variant functional severity
beyond MAF alone.
:::

---

## Tuning: Diagnostics

Enable diagnostics to assess test calibration:

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests fisher,skat,logistic_burden \
  --covariate-file covariates.tsv \
  --diagnostics-output diagnostics/ \
  --output-file results.tsv
```

### Output Files

| File | Format | Description |
|------|--------|-------------|
| `lambda_gc.tsv` | TSV: `test_name`, `lambda_gc`, `n_tests` | Genomic inflation factor per test |
| `qq_data.tsv` | TSV: `test`, `expected_neg_log10_p`, `observed_neg_log10_p` | QQ plot data (sorted ascending by expected) |
| `summary.txt` | Plain text | Human-readable summary: sample sizes, per-gene warnings, lambda_GC values |
| `qq_plot.png` | PNG (optional) | QQ plot — written only if `matplotlib` is installed |

### Interpreting lambda_GC

The genomic inflation factor lambda_GC is the ratio of the median observed chi-squared statistic
to the expected median (0.455 for chi2 with 1 df):

| Value | Interpretation |
|-------|---------------|
| ~1.0 | Well-calibrated — test statistics match the null distribution |
| >1.05 | Inflation — investigate population stratification, cryptic relatedness, or true polygenic signal |
| <0.95 | Deflation — may indicate overly conservative test or sparse variants |
| Flagged `[UNRELIABLE: n < 100]` | Fewer than 100 tested genes — lambda_GC is unreliable at this scale |

:::{note}
lambda_GC is unreliable when fewer than 100 genes were tested. For small gene panels, skip
lambda_GC and rely on individual gene results directly.
:::

:::{note}
For Fisher's exact test, lambda_GC is computed as a diagnostic indicator only — it is **not**
used to correct or adjust Fisher p-values. Use it to detect gross calibration issues, not as a
correction factor.
:::

---

## Tuning: Gene-Level FDR Weighting

By default, Benjamini-Hochberg FDR correction treats all genes equally. When you have prior
biological knowledge about which genes are more likely to be associated with your phenotype, you
can provide per-gene weights to increase statistical power for those genes while maintaining
overall FDR control.

This implements the **weighted Benjamini-Hochberg** procedure (Genovese et al. 2006), which
divides each gene's p-value by its weight before BH ranking. Genes with higher weights are
effectively given a larger share of the FDR budget, making them easier to detect. The overall
FDR guarantee is preserved as long as weights average to 1.0, which VariantCentrifuge enforces
automatically through renormalization.

### How It Works

1. You provide a TSV file mapping gene names to positive weights (higher = more likely relevant)
2. At correction time, weights are renormalized so their mean across *tested* genes equals 1.0
3. Each gene's ACAT-O p-value is divided by its normalized weight before BH ranking
4. Genes absent from the weight file receive weight 1.0 (neutral)

The net effect: high-weight genes become easier to call significant, low-weight genes become
harder, and unweighted genes are slightly penalized to compensate. If your priors are correct,
you gain power. If wrong, you lose some — but FDR is still controlled at the stated level.

### Weight File Format

A two-column TSV with a header row. The first column is the gene name (must match gene names in
the association output exactly), and the second column is the weight (positive number):

```
gene	weight
PKD1	3.0
PKD2	2.5
IFT140	2.0
BRCA1	1.8
BRCA2	1.8
TP53	1.5
TTN	0.5
```

Weights do not need to sum or average to any particular value — renormalization is automatic. A
weight of 2.0 means "I believe this gene is twice as likely to matter as a typical gene." A
weight below 1.0 means "this gene is less likely to matter" (e.g., TTN, which produces many
rare variants due to its size but is often a nuisance signal).

### Deriving Weights

There is no single correct way to derive weights. The choice depends on your study and what
prior information is available. Here are practical approaches:

**Gene constraint scores (recommended starting point):**
Use gnomAD pLI or LOEUF scores, which quantify how intolerant a gene is to loss-of-function
variation. Highly constrained genes are more likely to be disease-relevant:

```python
# Example: convert pLI to weights
# pLI ranges from 0 to 1; higher = more constrained
import pandas as pd
df = pd.read_csv("gnomad_constraint.tsv", sep="\t")
df["weight"] = 1.0 + 2.0 * df["pLI"]  # range: 1.0 to 3.0
df[["gene", "weight"]].to_csv("gene_weights.tsv", sep="\t", index=False)
```

**GWAS proximity:**
Genes near genome-wide significant GWAS loci for your phenotype get higher weights:

```
gene	weight
PKD1	3.0    # GWAS hit for kidney disease
PKD2	2.5    # GWAS hit for kidney disease
UMOD	2.0    # GWAS hit for kidney function
APOL1	2.0    # Known risk gene in specific populations
```

**Clinical gene panels:**
Genes on an established diagnostic panel for your phenotype get higher weights, non-panel genes
get neutral or lower weights:

```python
panel_genes = set(open("panel_genes.txt").read().split())
# Panel genes: weight 2.0, others: weight 1.0
weights = {g: 2.0 if g in panel_genes else 1.0 for g in all_genes}
```

**Literature-based:**
Manually assign weights based on published evidence. This is subjective but often the most
practical approach for small targeted studies.

**Combining sources:**
Multiply independent evidence sources (each normalized to mean ~1.0):

```python
weight = pli_weight * panel_weight * gwas_weight
```

:::{tip}
Keep weights moderate (0.5–5.0 range). Extreme weights (e.g., 100) concentrate nearly all FDR
budget on one gene and leave almost nothing for discoveries in other genes. If you are certain
a gene is causal, you probably do not need a statistical test for it.
:::

:::{warning}
A weight of 0 is not allowed (division by zero). Use a small positive value like 0.1 if you
want to strongly down-weight a gene.
:::

### CLI Flags

```bash
--gene-prior-weights weights.tsv         # Path to gene-weight TSV file
--gene-prior-weight-column weight        # Column name for weights (default: "weight")
```

### CLI Example

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests logistic_burden \
  --correction-method fdr \
  --gene-prior-weights gene_weights.tsv \
  --diagnostics-output diagnostics/ \
  --output-file results.tsv
```

### Output

When `--gene-prior-weights` is provided, the association output gains one additional column:

| Column | Description |
|--------|-------------|
| `fdr_weight` | Normalized weight used for this gene (after renormalization to mean=1.0) |

When `--gene-prior-weights` is **not** provided, the `fdr_weight` column is absent (backward
compatible).

### Diagnostics

When both `--gene-prior-weights` and `--diagnostics-output` are set, an additional diagnostics
file is written:

**`fdr_weight_diagnostics.tsv`** — per-gene breakdown:

| Column | Description |
|--------|-------------|
| `gene` | Gene name |
| `raw_weight` | Weight from the input file (1.0 if absent) |
| `normalized_weight` | Weight after renormalization to mean=1.0 |
| `unweighted_q` | q-value without weighting (standard BH) |
| `weighted_q` | q-value with weighting (weighted BH) |
| `significance_change` | `gained` if newly significant, `lost` if no longer significant, blank if unchanged |

The `significance_change` column highlights genes where weighting changed the significance
call at FDR < 0.05. This is the primary way to assess the impact of your weight choices.

### Coverage Warnings

If more than 50% of tested genes are absent from the weight file, a warning is logged:

```
WARNING: 85% of tested genes (148/174) are missing from gene weight file — weights may not be informative
```

This is not an error — missing genes receive weight 1.0 (neutral) — but it suggests the weight
file may be too narrow to meaningfully redistribute FDR budget.

### Effective Number of Tests

The diagnostics output includes the effective number of tests, computed as
`sum(w)^2 / sum(w^2)`. When all weights are equal, this equals the actual number of tests. When
weights vary, it is lower, reflecting the reduced multiple testing burden from concentrating
power on fewer genes. This metric helps assess how much the weights deviate from uniform.

---

## Tuning: JSON Config

All association parameters can be set via the main JSON config file (passed with `-c`/`--config`).
Use this for reproducible analyses or when parameters are too numerous for the command line.

### Annotated Example

```json
{
  "association": {
    "association_tests": ["fisher", "logistic_burden", "skat", "coast"],

    "correction_method": "fdr",
    "gene_burden_mode": "samples",
    "trait_type": "binary",

    "covariate_file": "covariates.tsv",
    "covariate_columns": ["age", "sex", "batch"],
    "categorical_covariates": ["sex", "batch"],

    "pca_file": "ancestry.eigenvec",
    "pca_components": 10,

    "variant_weights": "beta:1,25",
    "variant_weight_params": {"cadd_cap": 40.0},

    "coast_weights": [1, 2, 3],

    "gene_prior_weights": "gene_weights.tsv",
    "gene_prior_weight_column": "weight",

    "skat_backend": "python",
    "coast_backend": "python",

    "diagnostics_output": "diagnostics/",

    "min_cases": 200,
    "max_case_control_ratio": 20.0,
    "min_case_carriers": 10,

    "confidence_interval_alpha": 0.05,
    "continuity_correction": 0.5,

    "missing_site_threshold": 0.10,
    "missing_sample_threshold": 0.80,

    "firth_max_iter": 25
  }
}
```

### All 29 Valid Keys

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `association_tests` | list of str | `["fisher"]` | Tests to run: `fisher`, `logistic_burden`, `linear_burden`, `skat`, `coast` |
| `correction_method` | str | `"fdr"` | `"fdr"` (Benjamini-Hochberg) or `"bonferroni"` |
| `gene_burden_mode` | str | `"samples"` | `"samples"` (CMC/CAST carrier) or `"alleles"` (max dosage) |
| `trait_type` | str | `"binary"` | `"binary"` (case/control) or `"quantitative"` (continuous) |
| `covariate_file` | str | `null` | Path to covariate TSV/CSV |
| `covariate_columns` | list of str | `null` | Subset of covariate columns (null = all) |
| `categorical_covariates` | list of str | `null` | Columns to one-hot encode (null = auto-detect) |
| `pca_file` | str | `null` | Pre-computed PCA file (PLINK .eigenvec, AKT, or generic TSV) |
| `pca_tool` | str | `null` | `"akt"` to invoke AKT as subprocess |
| `pca_components` | int | `10` | Number of PCs to include (warn if >20) |
| `variant_weights` | str | `"beta:1,25"` | Weight scheme: `beta:a,b`, `uniform`, `cadd`, `revel`, `combined` |
| `variant_weight_params` | dict | `null` | Extra weight params, e.g. `{"cadd_cap": 30}` |
| `coast_weights` | list of float | `[1,2,3]` | COAST category weights: `[BMV, DMV, PTV]` |
| `gene_prior_weights` | str | `null` | Path to gene-weight TSV for weighted FDR correction |
| `gene_prior_weight_column` | str | `"weight"` | Column name in gene-weight file containing weights |
| `skat_backend` | str | `"python"` | `"python"`, `"auto"`, or `"r"` (deprecated) |
| `coast_backend` | str | `"python"` | `"python"`, `"auto"`, or `"r"` (deprecated) |
| `diagnostics_output` | str | `null` | Path to diagnostics directory |
| `min_cases` | int | `200` | Warn if case count below this |
| `max_case_control_ratio` | float | `20.0` | Warn if control-to-case ratio exceeds this |
| `min_case_carriers` | int | `10` | Per-gene warn if case carriers below this |
| `confidence_interval_alpha` | float | `0.05` | CI significance level (0.05 = 95% CI) |
| `continuity_correction` | float | `0.5` | Haldane-Anscombe continuity correction for zero cells in Fisher |
| `missing_site_threshold` | float | `0.10` | Exclude variants with >10% missing genotypes site-wide |
| `missing_sample_threshold` | float | `0.80` | Exclude samples with >80% missing genotypes |
| `skat_method` | str | `"SKAT"` | SKAT variant: `"SKAT"` (default), `"Burden"`, or `"SKATO"` (omnibus) |
| `confidence_interval_method` | str | `"normal_approx"` | Odds ratio CI method: tries score → normal → logit in sequence |
| `association_workers` | int | `1` | Number of parallel workers for association testing |
| `firth_max_iter` | int | `25` | Newton-Raphson iterations for Firth penalised likelihood |

### Precedence

CLI flags override JSON config values, which override `AssociationConfig` defaults:

```
CLI args > JSON "association" section > AssociationConfig defaults
```

Pass the config file via the existing `-c`/`--config` flag — there is no separate
`--association-config` flag.

:::{note}
Unknown keys in the `"association"` section raise a `ValueError` at startup with a list of
valid keys. This prevents silent misconfiguration from typos.
:::

---

## Test Selection Reference

Use this table to choose the right test(s) for your study:

| Test | CLI Name | Trait | Sample Size | What It Detects | Computational Cost | When to Use |
|------|---------|-------|-------------|-----------------|-------------------|-------------|
| Fisher's Exact | `fisher` | Binary | Any | Carrier enrichment (collapsed) | Negligible | First-pass screen; always appropriate |
| Logistic Burden | `logistic_burden` | Binary | ≥50 cases | Cumulative rare variant burden with covariate adjustment | Low | Binary trait with covariates |
| Linear Burden | `linear_burden` | Quantitative | ≥50 | Continuous phenotype burden | Low | Quantitative phenotype |
| SKAT/SKAT-O | `skat` | Binary or Quantitative | ≥100 cases | Heterogeneous effects (not all variants causal or same direction) | Moderate | Large cohorts; uncertain effect direction |
| COAST | `coast` | Binary or Quantitative | ≥100 cases | Ordered allelic series (PTV > DMV > BMV) | Moderate | Biological prior of monotone effect sizes; requires SIFT/PolyPhen |
| ACAT-O | (automatic) | Both | Depends on active tests | Omnibus combining all active tests | Negligible | Always present; primary significance measure |

**Recommended starting point for most studies:**

```bash
--association-tests fisher,logistic_burden,skat
```

This covers carrier enrichment (Fisher), covariate-adjusted regression (logistic burden), and
heterogeneous effects (SKAT-O), with ACAT-O combining all three.

Add `coast` when you have:
- SIFT and PolyPhen annotations in your VCF
- A biological prior that functional severity predicts effect magnitude
- Sufficient sample size (≥100 cases)

Use `linear_burden` instead of `logistic_burden` when:
- Your phenotype is a continuous measurement (e.g., GFR, blood pressure, biomarker level)
- You set `--trait-type quantitative`

---

## Troubleshooting

### R not found / rpy2 not installed

**Error:** `ImportError: rpy2 not installed` or `ImportError: Cannot find R installation`

**Solution:** Use the default Python backend — it requires no R installation:

```bash
--skat-backend python    # explicit
--coast-backend python   # explicit
# (These are the defaults in v0.15.0; you don't need to specify them)
```

The R backend is deprecated as of v0.15.0. If you specifically need R validation, install rpy2
and the R SKAT/AllelicSeries packages, then use `--skat-backend r`. A DeprecationWarning is
logged.

### Low Sample Size Warnings

**Warning in `warnings` column:** `LOW_CASE_COUNT`, `LOW_CARRIER_COUNT`, `IMBALANCED_COHORT`

These are triggered by thresholds in `AssociationConfig`:

| Warning | Default Threshold | Config Key |
|---------|------------------|------------|
| Total cases below threshold | 200 | `min_cases` |
| Per-gene case carriers below threshold | 10 | `min_case_carriers` |
| Control-to-case ratio exceeds threshold | 20.0 | `max_case_control_ratio` |

Adjust thresholds in JSON config if your study design intentionally has fewer cases. The analysis
still runs — these are warnings, not failures.

### Logistic Burden Convergence Failure

**Warning:** `FIRTH_CONVERGE_FAIL` in the `warnings` column, `logistic_burden_pvalue` = `None`

**What happened:** Both standard logistic regression and Firth penalised likelihood failed to
converge for this gene. Common cause: extreme imbalance (all carriers are cases), tiny carrier
count, or collinear covariates.

**Solutions:**
- Check `coast_n_ptv`/`coast_n_dmv`/`coast_n_bmv` (or `n_variants`) for the gene — very few
  carriers cannot support regression
- Remove highly correlated covariates
- Use Fisher's exact test for genes with very few carriers (it does not converge-fail)

### Convergence warnings `PERFECT_SEPARATION` or `QUASI_SEPARATION`

These appear when all (or nearly all) carriers in a gene are cases or controls. The Firth
fallback handles this automatically. If Firth also fails (`FIRTH_CONVERGE_FAIL`), the p-value
is set to `None` and the gene is excluded from ACAT-O.

### lambda_GC Inflation (lambda_GC > 1.05)

**In `diagnostics/lambda_gc.tsv`:** `lambda_gc` > 1.05

**Possible causes and actions:**

1. **Population stratification** (most common): Add PCA components via `--pca-file`
2. **Cryptic relatedness**: Verify your cohort does not include related individuals in
   case and control groups
3. **True polygenic signal**: If the trait is highly polygenic, mild inflation is expected
4. **Batch effects**: Add batch indicator to covariates via `--covariates` and
   `--categorical-covariates`

If `n_tests < 100`, the lambda_GC estimate is flagged as unreliable — do not over-interpret it.

### COAST Fails: `NO_CLASSIFIABLE_VARIANTS`

**In output:** `coast_pvalue` = `None`, and `coast_skip_reason` extra key contains `NO_CLASSIFIABLE_VARIANTS`

**What happened:** All variants in this gene received classification code 0 — they are missense
variants without SIFT or PolyPhen predictions. COAST requires at least one BMV, DMV, and PTV.

**Solutions:**
- Annotate your VCF with dbNSFP (provides `dbNSFP_SIFT_pred` and `dbNSFP_Polyphen2_HDIV_pred`)
- COAST requires PTVs — if a gene has only missense variants, COAST is not applicable
- Use SKAT or logistic burden instead

Check which annotation columns are present:
```bash
bcftools view -h input.vcf.gz | grep "##INFO" | grep -E "SIFT|Polyphen|PolyPhen"
```

### Sample ID Mismatch

**Error:** `ValueError` during covariate or PCA loading, mentioning sample IDs

**What happened:** Sample IDs in the covariate/PCA file do not match VCF sample IDs exactly
(comparison is case-sensitive).

**Fix:** Get the exact VCF sample IDs and use them in the first column of your covariate file:

```bash
bcftools query -l input.vcf.gz > vcf_samples.txt
head vcf_samples.txt
```

Ensure the first column of `covariates.tsv` matches these names exactly.

### ACAT-O Equals Single Test P-value

**What you see:** `acat_o_pvalue` is identical to `fisher_pvalue` for all genes.

**This is correct behavior.** When only one test is active, the Cauchy combination of a single
p-value returns that p-value unchanged (pass-through). FDR correction still runs across genes.

To get a true omnibus combination, enable multiple tests:
```bash
--association-tests fisher,logistic_burden,skat
```

### Missing ACAT-O Columns

ACAT-O columns are always present in the output when `--perform-association` is active. If you
do not see them, check that the output file is `results.association.tsv` (not `results.tsv`
itself — the association results go to a separate `.association.tsv` file).

---

## Comprehensive Example

The following command runs Fisher, logistic burden, SKAT-O, and COAST together with
covariate adjustment, PCA, functional weights, gene-level FDR weighting, and diagnostics:

```bash
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file input.vcf.gz \
  --phenotype-file phenotypes.tsv \
  --phenotype-sample-column sample_id \
  --phenotype-value-column case_control \
  --perform-association \
  --association-tests fisher,logistic_burden,skat,coast \
  --trait-type binary \
  --covariate-file covariates.tsv \
  --covariates age,sex,batch \
  --categorical-covariates sex,batch \
  --pca-file ancestry.eigenvec \
  --pca-components 10 \
  --variant-weights beta:1,25 \
  --coast-weights 1,2,3 \
  --gene-prior-weights gene_weights.tsv \
  --diagnostics-output diagnostics/ \
  --output-file results.tsv
```

Expected output files:

```
results.association.tsv       # Gene-level association results (includes fdr_weight column)
diagnostics/
  lambda_gc.tsv               # Genomic inflation factors
  qq_data.tsv                 # QQ plot data
  summary.txt                 # Human-readable summary
  qq_plot.png                 # QQ plot (if matplotlib installed)
  fdr_weight_diagnostics.tsv  # Per-gene weight impact analysis (when --gene-prior-weights set)
```

Representative output (first 3 columns omitted for brevity):

```
gene    n_cases  n_controls  n_variants  fisher_pvalue  ...  acat_o_qvalue  warnings
GENE1   312      488         7           0.00031             0.0047
GENE2   312      488         3           0.41                1.0
GENE3   312      488         12          8.2e-06             1.2e-04             LOW_CARRIER_COUNT
```