Annotation Strategies¶
This guide provides recommended strategies for annotating your VCF files before using VariantCentrifuge. Proper annotation is crucial for effective variant filtering and analysis.
Overview¶
VariantCentrifuge relies on annotated VCF fields to filter and analyze variants. The quality and comprehensiveness of your annotations directly impact the effectiveness of your analysis. This guide covers:
Recommended annotation tools and databases
Step-by-step annotation workflows
Database selection guidelines
Quality control and validation
Common pitfalls and solutions
Recommended Tools¶
Core Annotation Tools¶
SnpEff¶
Purpose: Functional effect prediction and gene annotation
Key Features:
Predicts variant effects (HIGH, MODERATE, LOW, MODIFIER impact)
Provides gene, transcript, and protein annotations
Supports multiple genome builds and species
Generates comprehensive ANN fields
Installation:
# Via conda/mamba
mamba install -c bioconda snpeff
# Download databases
snpEff download GRCh38.99 # or your preferred version
Basic Usage:
java -jar snpEff.jar GRCh38.99 input.vcf > annotated.vcf
SnpSift¶
Purpose: Database annotation and field extraction
Key Features:
Adds population frequency data (gnomAD, ExAC, 1000G)
Clinical significance annotations (ClinVar)
Pathogenicity predictions (CADD, REVEL, SIFT, PolyPhen)
Flexible field extraction and filtering
Installation:
# Usually bundled with SnpEff
mamba install -c bioconda snpsift
bcftools¶
Purpose: VCF manipulation and annotation
Key Features:
High-performance VCF processing
Custom annotation from BED/VCF files
Header manipulation and normalization
Variant normalization and filtering
Installation:
mamba install -c bioconda bcftools
Database Selection¶
Population Frequency Databases¶
gnomAD (Genome Aggregation Database)
Current version: v4.1 (recommended)
Coverage: 125,748 exomes, 15,708 genomes
Best for: General population frequency filtering
Fields:
gnomAD_exomes_AF
,gnomAD_genomes_AF
,gnomAD_*_AC
ExAC (Exome Aggregation Consortium)
Status: Legacy (replaced by gnomAD)
Use only if: gnomAD unavailable for your analysis
1000 Genomes Project
Best for: Population-specific frequency data
Limitation: Smaller sample size than gnomAD
Clinical Databases¶
ClinVar
Purpose: Clinical significance annotations
Updated: Monthly
Fields:
ClinVar_CLNSIG
,ClinVar_CLNDN
Critical for: Clinical variant interpretation
HGMD (Human Gene Mutation Database)
Purpose: Known disease mutations
License: Commercial (free version available)
Best for: Known pathogenic variant identification
Pathogenicity Prediction¶
CADD (Combined Annotation Dependent Depletion)
Range: 0-99 (higher = more deleterious)
Threshold: ≥20 (top 1% most deleterious)
Best for: Genome-wide pathogenicity scoring
REVEL (Rare Exome Variant Ensemble Learner)
Range: 0-1 (higher = more pathogenic)
Threshold: ≥0.5 for likely pathogenic
Best for: Missense variant prediction
dbNSFP
Contains: Multiple prediction scores (SIFT, PolyPhen, CADD, REVEL, etc.)
Advantage: Comprehensive collection
Best for: Comparative scoring
End-to-End Annotation Workflows¶
Workflow 1: Comprehensive Research Annotation¶
This workflow provides comprehensive annotation suitable for most research applications.
Step 1: Variant Normalization¶
# Normalize variants (split multi-allelic, left-align)
bcftools norm -m-both -f reference.fasta input.vcf.gz > normalized.vcf
Step 2: Functional Annotation with SnpEff¶
# Annotate with SnpEff
java -Xmx8g -jar snpEff.jar GRCh38.99 normalized.vcf > snpeff_annotated.vcf
# Alternative with custom options
java -Xmx8g -jar snpEff.jar -v -stats snpeff_summary.html GRCh38.99 normalized.vcf > snpeff_annotated.vcf
Step 3: Database Annotation with SnpSift¶
# Add gnomAD frequencies
java -jar SnpSift.jar annotate gnomad.exomes.r4.0.sites.vcf.gz snpeff_annotated.vcf > gnomad_annotated.vcf
# Add ClinVar annotations
java -jar SnpSift.jar annotate clinvar.vcf.gz gnomad_annotated.vcf > clinvar_annotated.vcf
# Add dbNSFP predictions
java -jar SnpSift.jar dbnsfp -db dbNSFP4.1a.txt.gz clinvar_annotated.vcf > final_annotated.vcf
Step 4: Quality Control¶
# Check annotation completeness
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' final_annotated.vcf | head -10
# Generate annotation statistics
java -jar SnpSift.jar extractFields final_annotated.vcf CHROM POS REF ALT "ANN[0].IMPACT" "gnomAD_exomes_AF" > annotation_check.tsv
Workflow 2: Clinical Diagnostic Annotation¶
Optimized for clinical variant interpretation with emphasis on known pathogenic variants.
Step 1: Core Annotation¶
# SnpEff with clinical focus
java -Xmx8g -jar snpEff.jar -canon GRCh38.99 input.vcf > clinical_snpeff.vcf
Step 2: Clinical Database Priority¶
# ClinVar (highest priority for clinical interpretation)
java -jar SnpSift.jar annotate -info "CLNSIG,CLNDN,CLNREVSTAT" clinvar.vcf.gz clinical_snpeff.vcf > clinical_clinvar.vcf
# HGMD (if available)
java -jar SnpSift.jar annotate hgmd.vcf.gz clinical_clinvar.vcf > clinical_hgmd.vcf
# gnomAD for population frequency
java -jar SnpSift.jar annotate -info "AF,AF_popmax" gnomad.exomes.vcf.gz clinical_hgmd.vcf > clinical_annotated.vcf
Step 3: Pathogenicity Scores¶
# Add CADD and REVEL via dbNSFP
java -jar SnpSift.jar dbnsfp -f CADD_phred,REVEL_score,SIFT_score,Polyphen2_HDIV_score -db dbNSFP4.1a.txt.gz clinical_annotated.vcf > clinical_final.vcf
Workflow 3: Cancer Genomics Annotation¶
Focused on somatic variant annotation with cancer-relevant databases.
Step 1: Somatic-Specific Annotation¶
# SnpEff with all transcripts (cancer analysis often needs multiple isoforms)
java -Xmx8g -jar snpEff.jar -canon -v GRCh38.99 somatic.vcf > cancer_snpeff.vcf
Step 2: Cancer Databases¶
# COSMIC (Catalogue of Somatic Mutations in Cancer)
java -jar SnpSift.jar annotate cosmic.vcf.gz cancer_snpeff.vcf > cancer_cosmic.vcf
# OncoKB annotations (if available)
# Custom annotation scripts may be needed
# Population frequencies (for germline contamination assessment)
java -jar SnpSift.jar annotate gnomad.genomes.vcf.gz cancer_cosmic.vcf > cancer_annotated.vcf
VariantCentrifuge Configuration Examples¶
Example 1: Rare Disease Analysis¶
{
"reference": "GRCh38.99",
"filters": "",
"fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF gnomAD_genomes_AF ClinVar_CLNSIG GEN[*].GT",
"presets": {
"rare_pathogenic": "(((gnomAD_exomes_AF < 0.001) | (na gnomAD_exomes_AF)) & ((gnomAD_genomes_AF < 0.001) | (na gnomAD_genomes_AF))) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE') | (ClinVar_CLNSIG =~ '[Pp]athogenic'))"
}
}
Example 2: Population Genetics Study¶
{
"reference": "GRCh38.99",
"fields_to_extract": "CHROM POS REF ALT ID ANN[0].GENE ANN[0].IMPACT gnomAD_exomes_AF gnomAD_exomes_AC gnomAD_genomes_AF gnomAD_genomes_AC GEN[*].GT GEN[*].DP",
"presets": {
"coding_variants": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
"high_quality": "(GEN[*].DP >= 20) & (QUAL >= 30)"
}
}
Example 3: Clinical Exome Analysis¶
{
"reference": "GRCh38.99",
"fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].FEATUREID ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF ClinVar_CLNSIG ClinVar_CLNDN GEN[*].GT",
"presets": {
"clinical_significance": "(ClinVar_CLNSIG =~ '[Pp]athogenic') | (ClinVar_CLNSIG =~ 'VUS')",
"rare_coding": "(((gnomAD_exomes_AF < 0.01) | (na gnomAD_exomes_AF)) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))"
}
}
Quality Control and Validation¶
Annotation Completeness Check¶
# Check annotation rates
bcftools query -f '%CHROM\t%POS\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' annotated.vcf | \
awk 'BEGIN{total=0; ann=0; gnomad=0}
{total++; if($3!="") ann++; if($4!="") gnomad++}
END{print "Total variants:", total; print "SnpEff annotated:", ann, "(" ann/total*100 "%)"; print "gnomAD annotated:", gnomad, "(" gnomad/total*100 "%)"}'
Validate Database Versions¶
# Check database versions in VCF headers
bcftools view -h annotated.vcf | grep "##INFO.*database"
Common Issues and Solutions¶
Issue: Low annotation rates¶
Symptoms: Many variants lack annotations Causes: Reference genome mismatch, outdated databases Solutions:
Verify genome build consistency (GRCh37 vs GRCh38)
Update annotation databases
Check chromosome naming (chr1 vs 1)
Issue: Inconsistent population frequencies¶
Symptoms: Unexpected frequency distributions Causes: Population stratification, database version differences Solutions:
Use population-specific frequency data
Validate against known common variants
Check for ancestry-specific databases
Issue: Missing pathogenicity predictions¶
Symptoms: Empty CADD/REVEL scores Causes: Variant type limitations, database coverage Solutions:
Use multiple prediction tools
Consider variant type-specific predictors
Manual curation for critical variants
Performance Optimization¶
Memory Management¶
# Increase Java heap size for large VCFs
java -Xmx16g -jar snpEff.jar ...
# Use streaming processing for very large files
bcftools annotate -a database.vcf.gz ...
Parallel Processing¶
# Split VCF by chromosome for parallel annotation
for chr in {1..22} X Y; do
bcftools view -r chr${chr} input.vcf.gz | \
java -jar snpEff.jar GRCh38.99 /dev/stdin > chr${chr}_annotated.vcf &
done
wait
# Merge results
bcftools concat chr*_annotated.vcf | bcftools sort > final_annotated.vcf.gz
Database Preparation¶
# Create tabix indexes for faster annotation
bgzip database.vcf
tabix -p vcf database.vcf.gz
# Prepare custom annotation files
bcftools annotate --check-ref e -a custom_annotations.vcf.gz ...
Custom Gene Annotations with VariantCentrifuge¶
In addition to standard VCF annotations, VariantCentrifuge provides built-in functionality to add custom annotations during analysis. These annotations are applied after variant extraction and are included in the final output.
JSON Gene Annotations¶
The --annotate-json-genes
feature allows you to integrate structured gene metadata from JSON files directly into your variant analysis.
JSON File Format¶
Create a JSON file containing an array of gene objects:
[
{
"gene_symbol": "BRCA1",
"panel": "HereditaryCancer",
"inheritance": "AD",
"function": "DNA repair",
"disease_association": "Breast/Ovarian cancer",
"actionability": "Tier1"
},
{
"gene_symbol": "TP53",
"panel": "HereditaryCancer",
"inheritance": "AD",
"function": "Tumor suppressor",
"disease_association": "Li-Fraumeni syndrome",
"actionability": "Tier1"
},
{
"gene_symbol": "MLH1",
"panel": "Lynch",
"inheritance": "AD",
"function": "DNA mismatch repair",
"disease_association": "Lynch syndrome",
"actionability": "Tier1"
}
]
Field Mapping Configuration¶
Use the --json-gene-mapping
parameter to specify how JSON fields map to annotations:
{
"identifier": "gene_symbol",
"dataFields": ["panel", "inheritance", "actionability"]
}
identifier
: The JSON field containing the gene symboldataFields
: Array of fields to include in the Custom_Annotation column
Usage Example¶
variantcentrifuge \
--gene-file cancer_genes.txt \
--vcf-file patient.vcf.gz \
--annotate-json-genes gene_metadata.json \
--json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","inheritance","actionability"]}' \
--output-file annotated_variants.tsv
This will add annotations like panel=HereditaryCancer;inheritance=AD;actionability=Tier1
to the Custom_Annotation column for variants in matching genes.
Adding JSON Annotations as Separate Columns¶
For a more structured output suitable for direct analysis, you can add each JSON data field as its own column instead of bundling them into the Custom_Annotation
field. This is enabled with the --json-genes-as-columns
flag.
Usage Example:
variantcentrifuge \
--gene-name GENE \
--vcf-file input.vcf.gz \
--annotate-json-genes gene_data.json \
--json-gene-mapping '{"identifier":"symbol","dataFields":["ngs","actionability"]}' \
--json-genes-as-columns \
--output-file variants_with_columns.tsv
This command will produce a TSV with two new columns, ngs
and actionability
, populated with the corresponding values from gene_data.json
.
BED File Annotations¶
Annotate variants with genomic regions using BED files:
variantcentrifuge \
--gene-name GENE \
--vcf-file input.vcf.gz \
--annotate-bed hotspots.bed \
--annotate-bed regulatory_regions.bed \
--output-file output.tsv
Gene List Annotations¶
Check if variants affect genes in custom lists:
variantcentrifuge \
--gene-file all_genes.txt \
--vcf-file input.vcf.gz \
--annotate-gene-list actionable_genes.txt \
--annotate-gene-list drug_targets.txt \
--output-file output.tsv
Combined Annotation Strategy¶
For comprehensive analysis, combine multiple annotation sources:
variantcentrifuge \
--gene-file disease_genes.txt \
--vcf-file patient.vcf.gz \
--annotate-bed known_hotspots.bed \
--annotate-gene-list clinically_actionable.txt \
--annotate-json-genes gene_database.json \
--json-gene-mapping '{"identifier":"symbol","dataFields":["panel","evidence","notes"]}' \
--preset rare,coding \
--html-report \
--output-file comprehensive_analysis.tsv
Integration with Scoring¶
Custom annotations can be used in variant scoring formulas. For example, if you annotate with actionability
levels, you can create scoring formulas that prioritize Tier1 actionable variants.
Best Practices Summary¶
Plan your annotation strategy based on your analysis goals
Use the latest database versions when possible
Validate annotation completeness before analysis
Document your annotation workflow for reproducibility
Test on small datasets before processing large cohorts
Keep database versions consistent across related analyses
Consider population-specific databases for diverse cohorts
Backup original VCF files before annotation
Monitor resource usage for large-scale annotations
Validate critical variants manually when needed
Use custom annotations to integrate project-specific gene metadata
Combine annotation sources for comprehensive variant characterization
By following these annotation strategies, you’ll ensure that VariantCentrifuge has access to high-quality, comprehensive variant annotations for effective filtering and analysis.