Annotation Strategies

This guide provides recommended strategies for annotating your VCF files before using VariantCentrifuge. Proper annotation is crucial for effective variant filtering and analysis.

Overview

VariantCentrifuge relies on annotated VCF fields to filter and analyze variants. The quality and comprehensiveness of your annotations directly impact the effectiveness of your analysis. This guide covers:

  • Recommended annotation tools and databases

  • Step-by-step annotation workflows

  • Database selection guidelines

  • Quality control and validation

  • Common pitfalls and solutions

End-to-End Annotation Workflows

Workflow 1: Comprehensive Research Annotation

This workflow provides comprehensive annotation suitable for most research applications.

Step 1: Variant Normalization

# Normalize variants (split multi-allelic, left-align)
bcftools norm -m-both -f reference.fasta input.vcf.gz > normalized.vcf

Step 2: Functional Annotation with SnpEff

# Annotate with SnpEff
java -Xmx8g -jar snpEff.jar GRCh38.99 normalized.vcf > snpeff_annotated.vcf

# Alternative with custom options
java -Xmx8g -jar snpEff.jar -v -stats snpeff_summary.html GRCh38.99 normalized.vcf > snpeff_annotated.vcf

Step 3: Database Annotation with SnpSift

# Add gnomAD frequencies
java -jar SnpSift.jar annotate gnomad.exomes.r4.0.sites.vcf.gz snpeff_annotated.vcf > gnomad_annotated.vcf

# Add ClinVar annotations
java -jar SnpSift.jar annotate clinvar.vcf.gz gnomad_annotated.vcf > clinvar_annotated.vcf

# Add dbNSFP predictions
java -jar SnpSift.jar dbnsfp -db dbNSFP4.1a.txt.gz clinvar_annotated.vcf > final_annotated.vcf

Step 4: Quality Control

# Check annotation completeness
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' final_annotated.vcf | head -10

# Generate annotation statistics
java -jar SnpSift.jar extractFields final_annotated.vcf CHROM POS REF ALT "ANN[0].IMPACT" "gnomAD_exomes_AF" > annotation_check.tsv

Workflow 2: Clinical Diagnostic Annotation

Optimized for clinical variant interpretation with emphasis on known pathogenic variants.

Step 1: Core Annotation

# SnpEff with clinical focus
java -Xmx8g -jar snpEff.jar -canon GRCh38.99 input.vcf > clinical_snpeff.vcf

Step 2: Clinical Database Priority

# ClinVar (highest priority for clinical interpretation)
java -jar SnpSift.jar annotate -info "CLNSIG,CLNDN,CLNREVSTAT" clinvar.vcf.gz clinical_snpeff.vcf > clinical_clinvar.vcf

# HGMD (if available)
java -jar SnpSift.jar annotate hgmd.vcf.gz clinical_clinvar.vcf > clinical_hgmd.vcf

# gnomAD for population frequency
java -jar SnpSift.jar annotate -info "AF,AF_popmax" gnomad.exomes.vcf.gz clinical_hgmd.vcf > clinical_annotated.vcf

Step 3: Pathogenicity Scores

# Add CADD and REVEL via dbNSFP
java -jar SnpSift.jar dbnsfp -f CADD_phred,REVEL_score,SIFT_score,Polyphen2_HDIV_score -db dbNSFP4.1a.txt.gz clinical_annotated.vcf > clinical_final.vcf

Workflow 3: Cancer Genomics Annotation

Focused on somatic variant annotation with cancer-relevant databases.

Step 1: Somatic-Specific Annotation

# SnpEff with all transcripts (cancer analysis often needs multiple isoforms)
java -Xmx8g -jar snpEff.jar -canon -v GRCh38.99 somatic.vcf > cancer_snpeff.vcf

Step 2: Cancer Databases

# COSMIC (Catalogue of Somatic Mutations in Cancer)
java -jar SnpSift.jar annotate cosmic.vcf.gz cancer_snpeff.vcf > cancer_cosmic.vcf

# OncoKB annotations (if available)
# Custom annotation scripts may be needed

# Population frequencies (for germline contamination assessment)
java -jar SnpSift.jar annotate gnomad.genomes.vcf.gz cancer_cosmic.vcf > cancer_annotated.vcf

VariantCentrifuge Configuration Examples

Example 1: Rare Disease Analysis

{
  "reference": "GRCh38.99",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF gnomAD_genomes_AF ClinVar_CLNSIG GEN[*].GT",
  "presets": {
    "rare_pathogenic": "(((gnomAD_exomes_AF < 0.001) | (na gnomAD_exomes_AF)) & ((gnomAD_genomes_AF < 0.001) | (na gnomAD_genomes_AF))) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE') | (ClinVar_CLNSIG =~ '[Pp]athogenic'))"
  }
}

Example 2: Population Genetics Study

{
  "reference": "GRCh38.99",
  "fields_to_extract": "CHROM POS REF ALT ID ANN[0].GENE ANN[0].IMPACT gnomAD_exomes_AF gnomAD_exomes_AC gnomAD_genomes_AF gnomAD_genomes_AC GEN[*].GT GEN[*].DP",
  "presets": {
    "coding_variants": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
    "high_quality": "(GEN[*].DP >= 20) & (QUAL >= 30)"
  }
}

Example 3: Clinical Exome Analysis

{
  "reference": "GRCh38.99",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].FEATUREID ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF ClinVar_CLNSIG ClinVar_CLNDN GEN[*].GT",
  "presets": {
    "clinical_significance": "(ClinVar_CLNSIG =~ '[Pp]athogenic') | (ClinVar_CLNSIG =~ 'VUS')",
    "rare_coding": "(((gnomAD_exomes_AF < 0.01) | (na gnomAD_exomes_AF)) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))"
  }
}

Quality Control and Validation

Annotation Completeness Check

# Check annotation rates
bcftools query -f '%CHROM\t%POS\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' annotated.vcf | \
awk 'BEGIN{total=0; ann=0; gnomad=0}
     {total++; if($3!="") ann++; if($4!="") gnomad++}
     END{print "Total variants:", total; print "SnpEff annotated:", ann, "(" ann/total*100 "%)"; print "gnomAD annotated:", gnomad, "(" gnomad/total*100 "%)"}'

Validate Database Versions

# Check database versions in VCF headers
bcftools view -h annotated.vcf | grep "##INFO.*database"

Common Issues and Solutions

Issue: Low annotation rates

Symptoms: Many variants lack annotations Causes: Reference genome mismatch, outdated databases Solutions:

  • Verify genome build consistency (GRCh37 vs GRCh38)

  • Update annotation databases

  • Check chromosome naming (chr1 vs 1)

Issue: Inconsistent population frequencies

Symptoms: Unexpected frequency distributions Causes: Population stratification, database version differences Solutions:

  • Use population-specific frequency data

  • Validate against known common variants

  • Check for ancestry-specific databases

Issue: Missing pathogenicity predictions

Symptoms: Empty CADD/REVEL scores Causes: Variant type limitations, database coverage Solutions:

  • Use multiple prediction tools

  • Consider variant type-specific predictors

  • Manual curation for critical variants

Performance Optimization

Memory Management

# Increase Java heap size for large VCFs
java -Xmx16g -jar snpEff.jar ...

# Use streaming processing for very large files
bcftools annotate -a database.vcf.gz ...

Parallel Processing

# Split VCF by chromosome for parallel annotation
for chr in {1..22} X Y; do
    bcftools view -r chr${chr} input.vcf.gz | \
    java -jar snpEff.jar GRCh38.99 /dev/stdin > chr${chr}_annotated.vcf &
done
wait

# Merge results
bcftools concat chr*_annotated.vcf | bcftools sort > final_annotated.vcf.gz

Database Preparation

# Create tabix indexes for faster annotation
bgzip database.vcf
tabix -p vcf database.vcf.gz

# Prepare custom annotation files
bcftools annotate --check-ref e -a custom_annotations.vcf.gz ...

Custom Gene Annotations with VariantCentrifuge

In addition to standard VCF annotations, VariantCentrifuge provides built-in functionality to add custom annotations during analysis. These annotations are applied after variant extraction and are included in the final output.

JSON Gene Annotations

The --annotate-json-genes feature allows you to integrate structured gene metadata from JSON files directly into your variant analysis.

JSON File Format

Create a JSON file containing an array of gene objects:

[
  {
    "gene_symbol": "BRCA1",
    "panel": "HereditaryCancer",
    "inheritance": "AD",
    "function": "DNA repair",
    "disease_association": "Breast/Ovarian cancer",
    "actionability": "Tier1"
  },
  {
    "gene_symbol": "TP53",
    "panel": "HereditaryCancer",
    "inheritance": "AD",
    "function": "Tumor suppressor",
    "disease_association": "Li-Fraumeni syndrome",
    "actionability": "Tier1"
  },
  {
    "gene_symbol": "MLH1",
    "panel": "Lynch",
    "inheritance": "AD",
    "function": "DNA mismatch repair",
    "disease_association": "Lynch syndrome",
    "actionability": "Tier1"
  }
]

Field Mapping Configuration

Use the --json-gene-mapping parameter to specify how JSON fields map to annotations:

{
  "identifier": "gene_symbol",
  "dataFields": ["panel", "inheritance", "actionability"]
}
  • identifier: The JSON field containing the gene symbol

  • dataFields: Array of fields to include in the Custom_Annotation column

Usage Example

variantcentrifuge \
  --gene-file cancer_genes.txt \
  --vcf-file patient.vcf.gz \
  --annotate-json-genes gene_metadata.json \
  --json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","inheritance","actionability"]}' \
  --output-file annotated_variants.tsv

This will add annotations like panel=HereditaryCancer;inheritance=AD;actionability=Tier1 to the Custom_Annotation column for variants in matching genes.

Adding JSON Annotations as Separate Columns

For a more structured output suitable for direct analysis, you can add each JSON data field as its own column instead of bundling them into the Custom_Annotation field. This is enabled with the --json-genes-as-columns flag.

Usage Example:

variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --annotate-json-genes gene_data.json \
  --json-gene-mapping '{"identifier":"symbol","dataFields":["ngs","actionability"]}' \
  --json-genes-as-columns \
  --output-file variants_with_columns.tsv

This command will produce a TSV with two new columns, ngs and actionability, populated with the corresponding values from gene_data.json.

BED File Annotations

Annotate variants with genomic regions using BED files:

variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --annotate-bed hotspots.bed \
  --annotate-bed regulatory_regions.bed \
  --output-file output.tsv

Gene List Annotations

Check if variants affect genes in custom lists:

variantcentrifuge \
  --gene-file all_genes.txt \
  --vcf-file input.vcf.gz \
  --annotate-gene-list actionable_genes.txt \
  --annotate-gene-list drug_targets.txt \
  --output-file output.tsv

Combined Annotation Strategy

For comprehensive analysis, combine multiple annotation sources:

variantcentrifuge \
  --gene-file disease_genes.txt \
  --vcf-file patient.vcf.gz \
  --annotate-bed known_hotspots.bed \
  --annotate-gene-list clinically_actionable.txt \
  --annotate-json-genes gene_database.json \
  --json-gene-mapping '{"identifier":"symbol","dataFields":["panel","evidence","notes"]}' \
  --preset rare,coding \
  --html-report \
  --output-file comprehensive_analysis.tsv

Integration with Scoring

Custom annotations can be used in variant scoring formulas. For example, if you annotate with actionability levels, you can create scoring formulas that prioritize Tier1 actionable variants.

Best Practices Summary

  1. Plan your annotation strategy based on your analysis goals

  2. Use the latest database versions when possible

  3. Validate annotation completeness before analysis

  4. Document your annotation workflow for reproducibility

  5. Test on small datasets before processing large cohorts

  6. Keep database versions consistent across related analyses

  7. Consider population-specific databases for diverse cohorts

  8. Backup original VCF files before annotation

  9. Monitor resource usage for large-scale annotations

  10. Validate critical variants manually when needed

  11. Use custom annotations to integrate project-specific gene metadata

  12. Combine annotation sources for comprehensive variant characterization

By following these annotation strategies, you’ll ensure that VariantCentrifuge has access to high-quality, comprehensive variant annotations for effective filtering and analysis.