Annotation Strategies¶

This guide provides recommended strategies for annotating your VCF files before using VariantCentrifuge. Proper annotation is crucial for effective variant filtering and analysis.

Overview¶

VariantCentrifuge relies on annotated VCF fields to filter and analyze variants. The quality and comprehensiveness of your annotations directly impact the effectiveness of your analysis. This guide covers:

Recommended annotation tools and databases
Step-by-step annotation workflows
Database selection guidelines
Quality control and validation
Common pitfalls and solutions

Recommended Tools¶

Core Annotation Tools¶

SnpEff¶

Purpose: Functional effect prediction and gene annotation

Key Features:

Predicts variant effects (HIGH, MODERATE, LOW, MODIFIER impact)
Provides gene, transcript, and protein annotations
Supports multiple genome builds and species
Generates comprehensive ANN fields

Installation:

# Via conda/mamba
mamba install -c bioconda snpeff

# Download databases
snpEff download GRCh38.99  # or your preferred version

Basic Usage:

java -jar snpEff.jar GRCh38.99 input.vcf > annotated.vcf

SnpSift¶

Purpose: Database annotation and field extraction

Key Features:

Adds population frequency data (gnomAD, ExAC, 1000G)
Clinical significance annotations (ClinVar)
Pathogenicity predictions (CADD, REVEL, SIFT, PolyPhen)
Flexible field extraction and filtering

Installation:

# Usually bundled with SnpEff
mamba install -c bioconda snpsift

bcftools¶

Purpose: VCF manipulation and annotation

Key Features:

High-performance VCF processing
Custom annotation from BED/VCF files
Header manipulation and normalization
Variant normalization and filtering

Installation:

mamba install -c bioconda bcftools

Database Selection¶

Population Frequency Databases¶

gnomAD (Genome Aggregation Database)

Current version: v4.1 (recommended)
Coverage: 125,748 exomes, 15,708 genomes
Best for: General population frequency filtering
Fields: gnomAD_exomes_AF, gnomAD_genomes_AF, gnomAD_*_AC

ExAC (Exome Aggregation Consortium)

Status: Legacy (replaced by gnomAD)
Use only if: gnomAD unavailable for your analysis

1000 Genomes Project

Best for: Population-specific frequency data
Limitation: Smaller sample size than gnomAD

Clinical Databases¶

ClinVar

Purpose: Clinical significance annotations
Updated: Monthly
Fields: ClinVar_CLNSIG, ClinVar_CLNDN
Critical for: Clinical variant interpretation

HGMD (Human Gene Mutation Database)

Purpose: Known disease mutations
License: Commercial (free version available)
Best for: Known pathogenic variant identification

Pathogenicity Prediction¶

CADD (Combined Annotation Dependent Depletion)

Range: 0-99 (higher = more deleterious)
Threshold: ≥20 (top 1% most deleterious)
Best for: Genome-wide pathogenicity scoring

REVEL (Rare Exome Variant Ensemble Learner)

Range: 0-1 (higher = more pathogenic)
Threshold: ≥0.5 for likely pathogenic
Best for: Missense variant prediction

dbNSFP

Contains: Multiple prediction scores (SIFT, PolyPhen, CADD, REVEL, etc.)
Advantage: Comprehensive collection
Best for: Comparative scoring

End-to-End Annotation Workflows¶

Workflow 1: Comprehensive Research Annotation¶

This workflow provides comprehensive annotation suitable for most research applications.

Step 1: Variant Normalization¶

# Normalize variants (split multi-allelic, left-align)
bcftools norm -m-both -f reference.fasta input.vcf.gz > normalized.vcf

Step 2: Functional Annotation with SnpEff¶

# Annotate with SnpEff
java -Xmx8g -jar snpEff.jar GRCh38.99 normalized.vcf > snpeff_annotated.vcf

# Alternative with custom options
java -Xmx8g -jar snpEff.jar -v -stats snpeff_summary.html GRCh38.99 normalized.vcf > snpeff_annotated.vcf

Step 3: Database Annotation with SnpSift¶

# Add gnomAD frequencies
java -jar SnpSift.jar annotate gnomad.exomes.r4.0.sites.vcf.gz snpeff_annotated.vcf > gnomad_annotated.vcf

# Add ClinVar annotations
java -jar SnpSift.jar annotate clinvar.vcf.gz gnomad_annotated.vcf > clinvar_annotated.vcf

# Add dbNSFP predictions
java -jar SnpSift.jar dbnsfp -db dbNSFP4.1a.txt.gz clinvar_annotated.vcf > final_annotated.vcf

Step 4: Quality Control¶

# Check annotation completeness
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' final_annotated.vcf | head -10

# Generate annotation statistics
java -jar SnpSift.jar extractFields final_annotated.vcf CHROM POS REF ALT "ANN[0].IMPACT" "gnomAD_exomes_AF" > annotation_check.tsv

Workflow 2: Clinical Diagnostic Annotation¶

Optimized for clinical variant interpretation with emphasis on known pathogenic variants.

Step 1: Core Annotation¶

# SnpEff with clinical focus
java -Xmx8g -jar snpEff.jar -canon GRCh38.99 input.vcf > clinical_snpeff.vcf

Step 2: Clinical Database Priority¶

# ClinVar (highest priority for clinical interpretation)
java -jar SnpSift.jar annotate -info "CLNSIG,CLNDN,CLNREVSTAT" clinvar.vcf.gz clinical_snpeff.vcf > clinical_clinvar.vcf

# HGMD (if available)
java -jar SnpSift.jar annotate hgmd.vcf.gz clinical_clinvar.vcf > clinical_hgmd.vcf

# gnomAD for population frequency
java -jar SnpSift.jar annotate -info "AF,AF_popmax" gnomad.exomes.vcf.gz clinical_hgmd.vcf > clinical_annotated.vcf

Step 3: Pathogenicity Scores¶

# Add CADD and REVEL via dbNSFP
java -jar SnpSift.jar dbnsfp -f CADD_phred,REVEL_score,SIFT_score,Polyphen2_HDIV_score -db dbNSFP4.1a.txt.gz clinical_annotated.vcf > clinical_final.vcf

Workflow 3: Cancer Genomics Annotation¶

Focused on somatic variant annotation with cancer-relevant databases.

Step 1: Somatic-Specific Annotation¶

# SnpEff with all transcripts (cancer analysis often needs multiple isoforms)
java -Xmx8g -jar snpEff.jar -canon -v GRCh38.99 somatic.vcf > cancer_snpeff.vcf

Step 2: Cancer Databases¶

# COSMIC (Catalogue of Somatic Mutations in Cancer)
java -jar SnpSift.jar annotate cosmic.vcf.gz cancer_snpeff.vcf > cancer_cosmic.vcf

# OncoKB annotations (if available)
# Custom annotation scripts may be needed

# Population frequencies (for germline contamination assessment)
java -jar SnpSift.jar annotate gnomad.genomes.vcf.gz cancer_cosmic.vcf > cancer_annotated.vcf

VariantCentrifuge Configuration Examples¶

Example 1: Rare Disease Analysis¶

{
  "reference": "GRCh38.99",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF gnomAD_genomes_AF ClinVar_CLNSIG GEN[*].GT",
  "presets": {
    "rare_pathogenic": "(((gnomAD_exomes_AF < 0.001) | (na gnomAD_exomes_AF)) & ((gnomAD_genomes_AF < 0.001) | (na gnomAD_genomes_AF))) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE') | (ClinVar_CLNSIG =~ '[Pp]athogenic'))"
  }
}

Example 2: Population Genetics Study¶

{
  "reference": "GRCh38.99",
  "fields_to_extract": "CHROM POS REF ALT ID ANN[0].GENE ANN[0].IMPACT gnomAD_exomes_AF gnomAD_exomes_AC gnomAD_genomes_AF gnomAD_genomes_AC GEN[*].GT GEN[*].DP",
  "presets": {
    "coding_variants": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
    "high_quality": "(GEN[*].DP >= 20) & (QUAL >= 30)"
  }
}

Example 3: Clinical Exome Analysis¶

{
  "reference": "GRCh38.99",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].FEATUREID ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF ClinVar_CLNSIG ClinVar_CLNDN GEN[*].GT",
  "presets": {
    "clinical_significance": "(ClinVar_CLNSIG =~ '[Pp]athogenic') | (ClinVar_CLNSIG =~ 'VUS')",
    "rare_coding": "(((gnomAD_exomes_AF < 0.01) | (na gnomAD_exomes_AF)) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))"
  }
}

Quality Control and Validation¶

Annotation Completeness Check¶

# Check annotation rates
bcftools query -f '%CHROM\t%POS\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' annotated.vcf | \
awk 'BEGIN{total=0; ann=0; gnomad=0}
     {total++; if($3!="") ann++; if($4!="") gnomad++}
     END{print "Total variants:", total; print "SnpEff annotated:", ann, "(" ann/total*100 "%)"; print "gnomAD annotated:", gnomad, "(" gnomad/total*100 "%)"}'

Validate Database Versions¶

# Check database versions in VCF headers
bcftools view -h annotated.vcf | grep "##INFO.*database"

Common Issues and Solutions¶

Issue: Low annotation rates¶

Symptoms: Many variants lack annotations Causes: Reference genome mismatch, outdated databases Solutions:

Verify genome build consistency (GRCh37 vs GRCh38)
Update annotation databases
Check chromosome naming (chr1 vs 1)

Issue: Inconsistent population frequencies¶

Symptoms: Unexpected frequency distributions Causes: Population stratification, database version differences Solutions:

Use population-specific frequency data
Validate against known common variants
Check for ancestry-specific databases

Issue: Missing pathogenicity predictions¶

Symptoms: Empty CADD/REVEL scores Causes: Variant type limitations, database coverage Solutions:

Use multiple prediction tools
Consider variant type-specific predictors
Manual curation for critical variants

Performance Optimization¶

Memory Management¶

# Increase Java heap size for large VCFs
java -Xmx16g -jar snpEff.jar ...

# Use streaming processing for very large files
bcftools annotate -a database.vcf.gz ...

Parallel Processing¶

# Split VCF by chromosome for parallel annotation
for chr in {1..22} X Y; do
    bcftools view -r chr${chr} input.vcf.gz | \
    java -jar snpEff.jar GRCh38.99 /dev/stdin > chr${chr}_annotated.vcf &
done
wait

# Merge results
bcftools concat chr*_annotated.vcf | bcftools sort > final_annotated.vcf.gz

Database Preparation¶

# Create tabix indexes for faster annotation
bgzip database.vcf
tabix -p vcf database.vcf.gz

# Prepare custom annotation files
bcftools annotate --check-ref e -a custom_annotations.vcf.gz ...

Custom Gene Annotations with VariantCentrifuge¶

In addition to standard VCF annotations, VariantCentrifuge provides built-in functionality to add custom annotations during analysis. These annotations are applied after variant extraction and are included in the final output.

JSON Gene Annotations¶

The --annotate-json-genes feature allows you to integrate structured gene metadata from JSON files directly into your variant analysis.

JSON File Format¶

Create a JSON file containing an array of gene objects:

[
  {
    "gene_symbol": "BRCA1",
    "panel": "HereditaryCancer",
    "inheritance": "AD",
    "function": "DNA repair",
    "disease_association": "Breast/Ovarian cancer",
    "actionability": "Tier1"
  },
  {
    "gene_symbol": "TP53",
    "panel": "HereditaryCancer",
    "inheritance": "AD",
    "function": "Tumor suppressor",
    "disease_association": "Li-Fraumeni syndrome",
    "actionability": "Tier1"
  },
  {
    "gene_symbol": "MLH1",
    "panel": "Lynch",
    "inheritance": "AD",
    "function": "DNA mismatch repair",
    "disease_association": "Lynch syndrome",
    "actionability": "Tier1"
  }
]

Field Mapping Configuration¶

Use the --json-gene-mapping parameter to specify how JSON fields map to annotations:

{
  "identifier": "gene_symbol",
  "dataFields": ["panel", "inheritance", "actionability"]
}

identifier: The JSON field containing the gene symbol
dataFields: Array of fields to include in the Custom_Annotation column

Usage Example¶

variantcentrifuge \
  --gene-file cancer_genes.txt \
  --vcf-file patient.vcf.gz \
  --annotate-json-genes gene_metadata.json \
  --json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","inheritance","actionability"]}' \
  --output-file annotated_variants.tsv

This will add annotations like panel=HereditaryCancer;inheritance=AD;actionability=Tier1 to the Custom_Annotation column for variants in matching genes.

Adding JSON Annotations as Separate Columns¶

For a more structured output suitable for direct analysis, you can add each JSON data field as its own column instead of bundling them into the Custom_Annotation field. This is enabled with the --json-genes-as-columns flag.

Usage Example:

variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --annotate-json-genes gene_data.json \
  --json-gene-mapping '{"identifier":"symbol","dataFields":["ngs","actionability"]}' \
  --json-genes-as-columns \
  --output-file variants_with_columns.tsv

This command will produce a TSV with two new columns, ngs and actionability, populated with the corresponding values from gene_data.json.

BED File Annotations¶

Annotate variants with genomic regions using BED files:

variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --annotate-bed hotspots.bed \
  --annotate-bed regulatory_regions.bed \
  --output-file output.tsv

Gene List Annotations¶

Check if variants affect genes in custom lists:

variantcentrifuge \
  --gene-file all_genes.txt \
  --vcf-file input.vcf.gz \
  --annotate-gene-list actionable_genes.txt \
  --annotate-gene-list drug_targets.txt \
  --output-file output.tsv

Combined Annotation Strategy¶

For comprehensive analysis, combine multiple annotation sources:

variantcentrifuge \
  --gene-file disease_genes.txt \
  --vcf-file patient.vcf.gz \
  --annotate-bed known_hotspots.bed \
  --annotate-gene-list clinically_actionable.txt \
  --annotate-json-genes gene_database.json \
  --json-gene-mapping '{"identifier":"symbol","dataFields":["panel","evidence","notes"]}' \
  --preset rare,coding \
  --html-report \
  --output-file comprehensive_analysis.tsv

Integration with Scoring¶

Custom annotations can be used in variant scoring formulas. For example, if you annotate with actionability levels, you can create scoring formulas that prioritize Tier1 actionable variants.

Best Practices Summary¶

Plan your annotation strategy based on your analysis goals
Use the latest database versions when possible
Validate annotation completeness before analysis
Document your annotation workflow for reproducibility
Test on small datasets before processing large cohorts
Keep database versions consistent across related analyses
Consider population-specific databases for diverse cohorts
Backup original VCF files before annotation
Monitor resource usage for large-scale annotations
Validate critical variants manually when needed
Use custom annotations to integrate project-specific gene metadata
Combine annotation sources for comprehensive variant characterization

By following these annotation strategies, you’ll ensure that VariantCentrifuge has access to high-quality, comprehensive variant annotations for effective filtering and analysis.