Usage Guide¶
Basic Usage¶
The most basic command to run VariantCentrifuge:
variantcentrifuge \
--gene-name BICC1 \
--vcf-file path/to/your.vcf \
--output-file output.tsv
Pipeline Flow¶
VariantCentrifuge processes variants through a multi-stage pipeline. The stages executed depend on the options you provide:
flowchart TD
VCF["VCF File"] --> BED["Gene BED Creation"]
BED --> PRE{"bcftools prefilter?"}
PRE -->|yes| BCF["bcftools Prefilter"]
PRE -->|no| EXT["Variant Extraction"]
BCF --> EXT
EXT --> FILT["SnpSift Filter"]
FILT --> SPLIT{"Split snpEff lines?"}
SPLIT -->|yes| SPL["Annotation Splitter"]
SPLIT -->|no| FE["Field Extraction"]
SPL --> FE
FE --> SORT["Data Sorting"]
SORT --> GT{"Genotype replacement?"}
GT -->|yes| REP["Genotype Replacement"]
GT -->|no| DF["DataFrame Loading"]
REP --> DF
DF --> INH{"Inheritance analysis?"}
INH -->|yes| PED["Inheritance Analysis"]
INH -->|no| ANN["Custom Annotations"]
PED --> ANN
ANN --> SCR{"Scoring model?"}
SCR -->|yes| SCORE["Variant Scoring"]
SCR -->|no| STAT["Statistics"]
SCORE --> STAT
STAT --> GB{"Gene burden?"}
GB -->|yes| BURDEN["Gene Burden Analysis"]
GB -->|no| FF{"Final filter?"}
BURDEN --> FF
FF -->|yes| FFILT["Final Filter"]
FF -->|no| PSE{"Pseudonymize?"}
FFILT --> PSE
PSE -->|yes| PSEUDO["Pseudonymization"]
PSE -->|no| OUT["TSV Output"]
PSEUDO --> OUT
OUT --> XLSX{"Excel?"}
OUT --> HTML{"HTML report?"}
OUT --> IGV{"IGV report?"}
XLSX -->|yes| XL["Excel Report"]
HTML -->|yes| HR["HTML Report"]
IGV -->|yes| IR["IGV Report"]
style VCF fill:#4a90d9,color:#fff
style OUT fill:#2ecc71,color:#fff
style XL fill:#2ecc71,color:#fff
style HR fill:#2ecc71,color:#fff
style IR fill:#2ecc71,color:#fff
Command Line Options¶
General Options¶
Flag |
Default |
Description |
|---|---|---|
|
— |
Show version and exit |
|
|
Logging level: |
|
— |
Write logs to file in addition to stderr |
|
— |
Path to JSON configuration file |
Core Input/Output¶
Flag |
Default |
Description |
|---|---|---|
|
required |
Input VCF file (can be gzip-compressed) |
|
— |
Output TSV file path, or |
|
|
Directory for intermediate and final output |
|
|
Also produce Excel output |
|
|
Keep all intermediate files after run |
|
|
Create timestamped |
Gene Selection¶
Flag |
Default |
Description |
|---|---|---|
|
— |
Gene name(s) (space-separated string) |
|
— |
File with gene names, one per line |
|
— |
Comma-separated transcript IDs to filter for (e.g., |
|
— |
File with transcript IDs, one per line |
Filtering Options¶
Flag |
Default |
Description |
|---|---|---|
|
from config |
snpEff reference database (e.g., |
|
from config |
SnpSift filter expression |
|
— |
bcftools expression for early variant pre-filtering during extraction |
|
— |
Apply named filter preset from config (repeatable, combined with AND). Use commas within a single |
|
|
Field profile for annotation database compatibility (e.g., |
|
— |
List available field profiles and exit |
|
|
Apply SnpSift filters after scoring (allows filtering on computed scores) |
|
— |
Pandas |
|
— |
Split multi-annotation lines: |
Tumor-Normal Filtering¶
These flags configure somatic variant presets (somatic, loh, tumor_only) by expanding template variables:
Flag |
Default |
Description |
|---|---|---|
|
|
0-based index of tumor sample in VCF |
|
|
0-based index of normal sample in VCF |
|
|
Minimum read depth for tumor sample |
|
|
Minimum read depth for normal sample |
|
|
Minimum allele frequency for tumor sample |
|
|
Maximum allele frequency for normal sample |
VCF Annotation Inspection¶
Flag |
Default |
Description |
|---|---|---|
|
|
Print VCF INFO/FORMAT fields and exit |
|
— |
Filter annotation output by substring (case-insensitive) |
|
|
Output format: |
|
|
Show only INFO fields (mutually exclusive with |
|
|
Show only FORMAT fields (mutually exclusive with |
Field Extraction & Formatting¶
Flag |
Default |
Description |
|---|---|---|
|
from config |
Fields to extract with SnpSift extractFields |
|
|
Skip genotype replacement step |
|
— |
Add blank column(s) to final output (repeatable) |
|
|
Disable adding URL link columns |
|
|
Add |
|
— |
Strip this substring from all sample names |
Genotype Analysis¶
Flag |
Default |
Description |
|---|---|---|
|
— |
Genotype filter: |
|
— |
TSV with per-gene genotype rules (columns: |
|
— |
Append extra FORMAT fields (e.g., |
|
|
Delimiter between genotype and extra fields |
Inheritance Analysis¶
Flag |
Default |
Description |
|---|---|---|
|
— |
PED file defining family structure for inheritance analysis |
|
|
Output format: |
|
|
Use original compound het implementation instead of vectorized (10-50x slower) |
Inheritance analysis is triggered when --ped or --inheritance-mode is specified. Without --ped, all samples are treated as affected singletons.
Supported inheritance patterns: de novo, autosomal dominant (AD), autosomal recessive (AR), X-linked recessive (XLR), X-linked dominant (XLD), compound heterozygous, mitochondrial.
ClinVar PM5 Annotation¶
Flag |
Default |
Description |
|---|---|---|
|
— |
Path to pre-built PM5 lookup table. Adds |
Phenotype & Sample Groups¶
Flag |
Default |
Description |
|---|---|---|
|
— |
Path to phenotype file ( |
|
— |
Column name for sample IDs in phenotype file |
|
— |
Column name for phenotype values |
|
— |
Comma-separated HPO terms defining case group |
|
— |
Comma-separated HPO terms defining control group |
|
— |
File with HPO terms for case group |
|
— |
File with HPO terms for control group |
|
— |
Comma-separated sample IDs for case group |
|
— |
Comma-separated sample IDs for control group |
|
— |
File with sample IDs for case group |
|
— |
File with sample IDs for control group |
Statistical Analysis¶
Flag |
Default |
Description |
|---|---|---|
|
|
Run gene burden analysis |
|
|
Gene burden mode: |
|
|
Multiple testing correction: |
|
|
Skip statistics computation step |
|
— |
File to write analysis statistics |
|
— |
Path to custom statistics configuration JSON |
Association Testing¶
For modular rare variant association testing beyond the basic Fisher’s exact test in --perform-gene-burden:
Flag |
Default |
Description |
|---|---|---|
|
|
Run the modular association testing framework |
|
|
Comma-separated tests: |
|
|
SKAT backend: |
|
|
COAST backend: |
|
— |
TSV/CSV covariate file (first column = sample ID, header required) |
|
— |
Comma-separated covariate column names to include (default: all columns) |
|
— |
Comma-separated columns to treat as categorical (auto-detected otherwise) |
|
|
Trait type for burden tests: |
|
— |
PCA file (PLINK |
|
— |
Set to |
|
|
Number of PCA components to include as covariates |
|
|
Variant weight scheme: |
|
— |
JSON string of weight scheme parameters (e.g., |
|
— |
COAST category weights as comma-separated floats (BMV,DMV,PTV) |
|
— |
Directory for diagnostics output: |
For detailed usage, test selection guidance, and examples, see the Association Testing Guide.
Scoring & Custom Annotations¶
Flag |
Default |
Description |
|---|---|---|
|
— |
Directory containing scoring model ( |
|
— |
BED file for region annotation (repeatable) |
|
— |
Gene list file — adds yes/no column per file (repeatable) |
|
— |
JSON gene data file (repeatable, requires |
|
— |
JSON string mapping fields: |
|
|
Add each dataField as its own column instead of |
Reporting & Visualization¶
Flag |
Default |
Description |
|---|---|---|
|
|
Generate interactive HTML report with filtering, charts, and summary dashboard |
|
|
Enable IGV.js integration (requires |
|
— |
TSV/CSV mapping sample IDs to BAM files |
|
— |
Genome reference for IGV (e.g., |
|
— |
Local FASTA file for IGV (overrides |
|
— |
Ideogram file for chromosome visualization |
|
|
Flanking region in bp for IGV reports |
Performance & Processing¶
Flag |
Default |
Description |
|---|---|---|
|
|
CPU cores for parallel processing ( |
|
|
Disable chunked processing (may cause memory issues on large files) |
|
|
Force chunked processing even for small files |
|
|
Memory for external sort (e.g., |
|
|
Parallel threads for sorting |
|
|
Method: |
|
auto-detected |
Maximum memory in GB for processing |
|
|
Force inheritance analysis even if exceeding memory limits |
|
|
Column creation: |
|
|
Fraction of allocated memory to use (0–1) |
|
|
Fraction of safe memory for inheritance analysis |
Checkpoint & Resume¶
Flag |
Default |
Description |
|---|---|---|
|
|
Enable checkpoint tracking for pipeline state |
|
|
Resume from last successful checkpoint |
|
— |
Restart from a specific pipeline stage |
|
|
Use SHA256 checksums for file validation (recommended for production) |
|
— |
Show checkpoint status and exit |
|
— |
List all available stages for current configuration and exit |
|
— |
List completed stages from checkpoint file and exit |
|
— |
Interactively select resume point |
See the Resume System documentation for details.
Data Privacy¶
Flag |
Default |
Description |
|---|---|---|
|
|
Enable sample pseudonymization |
|
|
Schema: |
|
|
Prefix for sequential schema |
|
— |
Custom pattern for |
|
|
Metadata field for categorical schema |
|
— |
Path to save mapping table (required with |
|
|
Also create pseudonymized PED file |
See Privacy and Pseudonymization for details.
Miscellaneous¶
Flag |
Default |
Description |
|---|---|---|
|
|
Compress intermediate TSV files (fast level-1 gzip) |
|
— |
Disable intermediate compression |
Examples¶
Basic Gene Analysis¶
variantcentrifuge \
--gene-name BRCA1 \
--vcf-file samples.vcf.gz \
--output-file brca1_variants.tsv
Filtered Analysis with Presets¶
variantcentrifuge \
--gene-file cancer_genes.txt \
--vcf-file samples.vcf.gz \
--preset rare,coding \
--html-report \
--xlsx \
--output-file cancer_variants.tsv
Family Trio with Inheritance Analysis¶
variantcentrifuge \
--gene-file disease_genes.txt \
--vcf-file trio.vcf.gz \
--ped family.ped \
--inheritance-mode columns \
--preset rare,coding \
--html-report \
--output-file trio_analysis.tsv
Comprehensive Analysis with Reports¶
variantcentrifuge \
--gene-name BRCA1 \
--vcf-file samples.vcf.gz \
--phenotype-file patient_data.tsv \
--phenotype-sample-column "sample_id" \
--phenotype-value-column "disease_status" \
--perform-gene-burden \
--html-report \
--xlsx \
--output-file brca1_analysis.tsv
IGV Integration¶
variantcentrifuge \
--gene-name TP53 \
--vcf-file samples.vcf.gz \
--igv \
--bam-mapping-file bam_files.tsv \
--igv-reference hg38 \
--html-report \
--output-file tp53_variants.tsv
Variant Scoring¶
variantcentrifuge \
--gene-file kidney_genes.txt \
--vcf-file patient.vcf.gz \
--scoring-config-path scoring/nephro_candidate_score \
--preset rare,coding \
--html-report \
--output-file scored_variants.tsv
Tumor-Normal Somatic Analysis¶
variantcentrifuge \
--gene-file oncogenes.txt \
--vcf-file tumor_normal.vcf.gz \
--preset somatic,coding \
--tumor-sample-index 1 \
--normal-sample-index 0 \
--tumor-dp-min 30 \
--tumor-af-min 0.05 \
--normal-af-max 0.02 \
--html-report \
--output-file somatic_variants.tsv
Custom Annotations¶
# Annotate with JSON gene information
variantcentrifuge \
--gene-name BRCA1 \
--vcf-file samples.vcf.gz \
--annotate-json-genes gene_metadata.json \
--json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","inheritance","function"]}' \
--output-file annotated_variants.tsv
# Multiple annotation sources
variantcentrifuge \
--gene-file cancer_genes.txt \
--vcf-file samples.vcf.gz \
--annotate-bed cancer_hotspots.bed \
--annotate-gene-list actionable_genes.txt \
--annotate-json-genes gene_panels.json \
--json-gene-mapping '{"identifier":"symbol","dataFields":["panel_name","evidence_level"]}' \
--html-report \
--output-file multi_annotated.tsv
Advanced Filtering¶
# Pre-filter with bcftools for performance
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file large_cohort.vcf.gz \
--bcftools-prefilter 'FILTER="PASS" && INFO/AC<10' \
--preset rare,coding \
--output-file filtered_variants.tsv
# Final filter using pandas query syntax
variantcentrifuge \
--gene-file cancer_genes.txt \
--vcf-file samples.vcf.gz \
--preset rare,coding \
--scoring-config-path scoring/nephro_candidate_score \
--final-filter 'nephro_candidate_score > 5 and IMPACT == "HIGH"' \
--output-file high_priority_variants.tsv
# Filter on inheritance patterns
variantcentrifuge \
--gene-file disease_genes.txt \
--vcf-file trio.vcf.gz \
--ped family.ped \
--inheritance-mode columns \
--final-filter 'Inheritance_Pattern in ["de_novo", "compound_heterozygous"]' \
--output-file denovo_and_compound_het.tsv
Checkpoint and Resume¶
# Run large analysis with checkpoint tracking
variantcentrifuge \
--gene-file all_protein_coding_genes.txt \
--vcf-file large_cohort.vcf.gz \
--enable-checkpoint \
--threads 16 \
--preset rare,coding \
--html-report \
--output-file all_genes_analysis.tsv
# If interrupted, resume from last checkpoint
variantcentrifuge \
--gene-file all_protein_coding_genes.txt \
--vcf-file large_cohort.vcf.gz \
--enable-checkpoint \
--resume \
--threads 16 \
--preset rare,coding \
--html-report \
--output-file all_genes_analysis.tsv
# Check status of a previous run
variantcentrifuge \
--show-checkpoint-status \
--output-dir previous_analysis/
Docker¶
All examples above work inside the Docker container by mounting your data directory:
docker run --rm -v ./data:/data \
ghcr.io/scholl-lab/variantcentrifuge:latest \
--gene-name BRCA1 \
--vcf-file /data/input.vcf.gz \
--preset rare,coding \
--html-report \
--output-file /data/output.tsv
See the Installation Guide for setup details.
Input File Formats¶
VCF Files¶
Standard VCF format (v4.0 or later)
Can be compressed with gzip (.vcf.gz)
Should be annotated with snpEff for optimal functionality
Gene Files¶
Text file with one gene name per line:
BRCA1
BRCA2
TP53
ATM
PED Files¶
Standard PLINK pedigree format (tab-separated, 6 columns):
#Family Individual Father Mother Sex Affected
FAM001 proband father mother 1 2
FAM001 father 0 0 1 1
FAM001 mother 0 0 2 1
Sex: 1=male, 2=female, 0=unknown
Affected: 1=unaffected, 2=affected, 0=unknown
Sample Mapping Files¶
Tab-separated file for genotype replacement:
original_id new_id
sample_001 Patient_A
sample_002 Patient_B
sample_003 Control_001
Phenotype Files¶
Tab or comma-separated file with sample information:
sample_id disease_status age sex
Patient_A case 45 F
Patient_B case 52 M
Control_001 control 48 F
BAM Mapping Files¶
For IGV integration, provide a mapping from sample IDs to BAM file paths:
sample_id bam_path
Patient_A /path/to/patient_a.bam
Patient_B /path/to/patient_b.bam
Control_001 /path/to/control_001.bam
JSON Gene Files¶
For gene annotation, provide a JSON file containing an array of gene objects:
[
{
"gene_symbol": "BRCA1",
"panel": "HereditaryCancer",
"inheritance": "AD",
"function": "DNA repair"
},
{
"gene_symbol": "TP53",
"panel": "HereditaryCancer",
"inheritance": "AD",
"function": "Tumor suppressor"
}
]
The --json-gene-mapping parameter specifies:
identifier: The field containing the gene symbol (e.g., “gene_symbol”)dataFields: Array of fields to include as annotations (e.g., [“panel”, “inheritance”, “function”])
Output Files¶
Main Output¶
TSV file — Tab-separated variant table with extracted fields
XLSX file — Excel format (if
--xlsxspecified)Metadata file — Analysis parameters and tool versions
Optional Outputs¶
HTML report — Interactive variant browser with filtering, summary dashboard, and charts (if
--html-reportspecified)IGV reports — Individual variant visualization (if
--igvspecified)Gene burden results — Statistical analysis (if
--perform-gene-burdenspecified)Pseudonymization mapping — Sample ID mapping table (if
--pseudonymizespecified)
Configuration¶
See the Configuration Guide for detailed information about setting up configuration files and customizing VariantCentrifuge behavior.
Troubleshooting¶
Common Issues¶
No variants found:
Check that your VCF file contains variants in the specified gene regions
Verify gene names are correct and match your reference annotation
Review filter expressions — they may be too restrictive
External tool errors:
Ensure all required tools are installed and in PATH
Check that snpEff database matches your VCF reference
Verify file permissions and disk space
Memory issues:
Use
--bcftools-prefilterto reduce data earlyTry
--genotype-replacement-method chunked-vectorizedSet
--max-memory-gbto limit memory usageUse
--threads 1to reduce concurrent memory pressure
Getting Help¶
Use
variantcentrifuge --helpfor command-line optionsCheck the API Reference for detailed function documentation
Report issues on GitHub