# Annotation Strategies

This guide provides recommended strategies for annotating your VCF files before using VariantCentrifuge. Proper annotation is crucial for effective variant filtering and analysis.

## Overview

VariantCentrifuge relies on annotated VCF fields to filter and analyze variants. The quality and comprehensiveness of your annotations directly impact the effectiveness of your analysis. This guide covers:

- Recommended annotation tools and databases
- Step-by-step annotation workflows
- Database selection guidelines
- Quality control and validation
- Common pitfalls and solutions

## Recommended Tools

### Core Annotation Tools

#### SnpEff
**Purpose:** Functional effect prediction and gene annotation

**Key Features:**
- Predicts variant effects (HIGH, MODERATE, LOW, MODIFIER impact)
- Provides gene, transcript, and protein annotations
- Supports multiple genome builds and species
- Generates comprehensive ANN fields

**Installation:**
```bash
# Via conda/mamba
mamba install -c bioconda snpeff

# Download databases
snpEff download GRCh38.99  # or your preferred version
```

**Basic Usage:**
```bash
java -jar snpEff.jar GRCh38.99 input.vcf > annotated.vcf
```

#### SnpSift
**Purpose:** Database annotation and field extraction

**Key Features:**
- Adds population frequency data (gnomAD, ExAC, 1000G)
- Clinical significance annotations (ClinVar)
- Pathogenicity predictions (CADD, REVEL, SIFT, PolyPhen)
- Flexible field extraction and filtering

**Installation:**
```bash
# Usually bundled with SnpEff
mamba install -c bioconda snpsift
```

#### bcftools
**Purpose:** VCF manipulation and annotation

**Key Features:**
- High-performance VCF processing
- Custom annotation from BED/VCF files
- Header manipulation and normalization
- Variant normalization and filtering

**Installation:**
```bash
mamba install -c bioconda bcftools
```

### Database Selection

#### Population Frequency Databases

**gnomAD (Genome Aggregation Database)**
- **Current version:** v4.1 (recommended)
- **Coverage:** 125,748 exomes, 15,708 genomes
- **Best for:** General population frequency filtering
- **Fields:** `gnomAD_exomes_AF`, `gnomAD_genomes_AF`, `gnomAD_*_AC`

**ExAC (Exome Aggregation Consortium)**
- **Status:** Legacy (replaced by gnomAD)
- **Use only if:** gnomAD unavailable for your analysis

**1000 Genomes Project**
- **Best for:** Population-specific frequency data
- **Limitation:** Smaller sample size than gnomAD

#### Clinical Databases

**ClinVar**
- **Purpose:** Clinical significance annotations
- **Updated:** Monthly
- **Fields:** `ClinVar_CLNSIG`, `ClinVar_CLNDN`
- **Critical for:** Clinical variant interpretation

**HGMD (Human Gene Mutation Database)**
- **Purpose:** Known disease mutations
- **License:** Commercial (free version available)
- **Best for:** Known pathogenic variant identification

#### Pathogenicity Prediction

**CADD (Combined Annotation Dependent Depletion)**
- **Range:** 0-99 (higher = more deleterious)
- **Threshold:** ≥20 (top 1% most deleterious)
- **Best for:** Genome-wide pathogenicity scoring

**REVEL (Rare Exome Variant Ensemble Learner)**
- **Range:** 0-1 (higher = more pathogenic)
- **Threshold:** ≥0.5 for likely pathogenic
- **Best for:** Missense variant prediction

**dbNSFP**
- **Contains:** Multiple prediction scores (SIFT, PolyPhen, CADD, REVEL, etc.)
- **Advantage:** Comprehensive collection
- **Best for:** Comparative scoring

## End-to-End Annotation Workflows

### Workflow 1: Comprehensive Research Annotation

This workflow provides comprehensive annotation suitable for most research applications.

#### Step 1: Variant Normalization
```bash
# Normalize variants (split multi-allelic, left-align)
bcftools norm -m-both -f reference.fasta input.vcf.gz > normalized.vcf
```

#### Step 2: Functional Annotation with SnpEff
```bash
# Annotate with SnpEff
java -Xmx8g -jar snpEff.jar GRCh38.99 normalized.vcf > snpeff_annotated.vcf

# Alternative with custom options
java -Xmx8g -jar snpEff.jar -v -stats snpeff_summary.html GRCh38.99 normalized.vcf > snpeff_annotated.vcf
```

#### Step 3: Database Annotation with SnpSift
```bash
# Add gnomAD frequencies
java -jar SnpSift.jar annotate gnomad.exomes.r4.0.sites.vcf.gz snpeff_annotated.vcf > gnomad_annotated.vcf

# Add ClinVar annotations
java -jar SnpSift.jar annotate clinvar.vcf.gz gnomad_annotated.vcf > clinvar_annotated.vcf

# Add dbNSFP predictions
java -jar SnpSift.jar dbnsfp -db dbNSFP4.1a.txt.gz clinvar_annotated.vcf > final_annotated.vcf
```

#### Step 4: Quality Control
```bash
# Check annotation completeness
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' final_annotated.vcf | head -10

# Generate annotation statistics
java -jar SnpSift.jar extractFields final_annotated.vcf CHROM POS REF ALT "ANN[0].IMPACT" "gnomAD_exomes_AF" > annotation_check.tsv
```

### Workflow 2: Clinical Diagnostic Annotation

Optimized for clinical variant interpretation with emphasis on known pathogenic variants.

#### Step 1: Core Annotation
```bash
# SnpEff with clinical focus
java -Xmx8g -jar snpEff.jar -canon GRCh38.99 input.vcf > clinical_snpeff.vcf
```

#### Step 2: Clinical Database Priority
```bash
# ClinVar (highest priority for clinical interpretation)
java -jar SnpSift.jar annotate -info "CLNSIG,CLNDN,CLNREVSTAT" clinvar.vcf.gz clinical_snpeff.vcf > clinical_clinvar.vcf

# HGMD (if available)
java -jar SnpSift.jar annotate hgmd.vcf.gz clinical_clinvar.vcf > clinical_hgmd.vcf

# gnomAD for population frequency
java -jar SnpSift.jar annotate -info "AF,AF_popmax" gnomad.exomes.vcf.gz clinical_hgmd.vcf > clinical_annotated.vcf
```

#### Step 3: Pathogenicity Scores
```bash
# Add CADD and REVEL via dbNSFP
java -jar SnpSift.jar dbnsfp -f CADD_phred,REVEL_score,SIFT_score,Polyphen2_HDIV_score -db dbNSFP4.1a.txt.gz clinical_annotated.vcf > clinical_final.vcf
```

### Workflow 3: Cancer Genomics Annotation

Focused on somatic variant annotation with cancer-relevant databases.

#### Step 1: Somatic-Specific Annotation
```bash
# SnpEff with all transcripts (cancer analysis often needs multiple isoforms)
java -Xmx8g -jar snpEff.jar -canon -v GRCh38.99 somatic.vcf > cancer_snpeff.vcf
```

#### Step 2: Cancer Databases
```bash
# COSMIC (Catalogue of Somatic Mutations in Cancer)
java -jar SnpSift.jar annotate cosmic.vcf.gz cancer_snpeff.vcf > cancer_cosmic.vcf

# OncoKB annotations (if available)
# Custom annotation scripts may be needed

# Population frequencies (for germline contamination assessment)
java -jar SnpSift.jar annotate gnomad.genomes.vcf.gz cancer_cosmic.vcf > cancer_annotated.vcf
```

## VariantCentrifuge Configuration Examples

### Example 1: Rare Disease Analysis

```json
{
  "reference": "GRCh38.99",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF gnomAD_genomes_AF ClinVar_CLNSIG GEN[*].GT",
  "presets": {
    "rare_pathogenic": "(((gnomAD_exomes_AF < 0.001) | (na gnomAD_exomes_AF)) & ((gnomAD_genomes_AF < 0.001) | (na gnomAD_genomes_AF))) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE') | (ClinVar_CLNSIG =~ '[Pp]athogenic'))"
  }
}
```

### Example 2: Population Genetics Study

```json
{
  "reference": "GRCh38.99",
  "fields_to_extract": "CHROM POS REF ALT ID ANN[0].GENE ANN[0].IMPACT gnomAD_exomes_AF gnomAD_exomes_AC gnomAD_genomes_AF gnomAD_genomes_AC GEN[*].GT GEN[*].DP",
  "presets": {
    "coding_variants": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
    "high_quality": "(GEN[*].DP >= 20) & (QUAL >= 30)"
  }
}
```

### Example 3: Clinical Exome Analysis

```json
{
  "reference": "GRCh38.99",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].FEATUREID ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF ClinVar_CLNSIG ClinVar_CLNDN GEN[*].GT",
  "presets": {
    "clinical_significance": "(ClinVar_CLNSIG =~ '[Pp]athogenic') | (ClinVar_CLNSIG =~ 'VUS')",
    "rare_coding": "(((gnomAD_exomes_AF < 0.01) | (na gnomAD_exomes_AF)) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))"
  }
}
```

## Quality Control and Validation

### Annotation Completeness Check

```bash
# Check annotation rates
bcftools query -f '%CHROM\t%POS\t%INFO/ANN\t%INFO/gnomAD_exomes_AF\n' annotated.vcf | \
awk 'BEGIN{total=0; ann=0; gnomad=0}
     {total++; if($3!="") ann++; if($4!="") gnomad++}
     END{print "Total variants:", total; print "SnpEff annotated:", ann, "(" ann/total*100 "%)"; print "gnomAD annotated:", gnomad, "(" gnomad/total*100 "%)"}'
```

### Validate Database Versions

```bash
# Check database versions in VCF headers
bcftools view -h annotated.vcf | grep "##INFO.*database"
```

### Common Issues and Solutions

#### Issue: Low annotation rates
**Symptoms:** Many variants lack annotations
**Causes:** Reference genome mismatch, outdated databases
**Solutions:**
- Verify genome build consistency (GRCh37 vs GRCh38)
- Update annotation databases
- Check chromosome naming (chr1 vs 1)

#### Issue: Inconsistent population frequencies
**Symptoms:** Unexpected frequency distributions
**Causes:** Population stratification, database version differences
**Solutions:**
- Use population-specific frequency data
- Validate against known common variants
- Check for ancestry-specific databases

#### Issue: Missing pathogenicity predictions
**Symptoms:** Empty CADD/REVEL scores
**Causes:** Variant type limitations, database coverage
**Solutions:**
- Use multiple prediction tools
- Consider variant type-specific predictors
- Manual curation for critical variants

## Performance Optimization

### Memory Management
```bash
# Increase Java heap size for large VCFs
java -Xmx16g -jar snpEff.jar ...

# Use streaming processing for very large files
bcftools annotate -a database.vcf.gz ...
```

### Parallel Processing
```bash
# Split VCF by chromosome for parallel annotation
for chr in {1..22} X Y; do
    bcftools view -r chr${chr} input.vcf.gz | \
    java -jar snpEff.jar GRCh38.99 /dev/stdin > chr${chr}_annotated.vcf &
done
wait

# Merge results
bcftools concat chr*_annotated.vcf | bcftools sort > final_annotated.vcf.gz
```

### Database Preparation
```bash
# Create tabix indexes for faster annotation
bgzip database.vcf
tabix -p vcf database.vcf.gz

# Prepare custom annotation files
bcftools annotate --check-ref e -a custom_annotations.vcf.gz ...
```

## Custom Gene Annotations with VariantCentrifuge

In addition to standard VCF annotations, VariantCentrifuge provides built-in functionality to add custom annotations during analysis. These annotations are applied after variant extraction and are included in the final output.

### JSON Gene Annotations

The `--annotate-json-genes` feature allows you to integrate structured gene metadata from JSON files directly into your variant analysis.

#### JSON File Format

Create a JSON file containing an array of gene objects:

```json
[
  {
    "gene_symbol": "BRCA1",
    "panel": "HereditaryCancer",
    "inheritance": "AD",
    "function": "DNA repair",
    "disease_association": "Breast/Ovarian cancer",
    "actionability": "Tier1"
  },
  {
    "gene_symbol": "TP53",
    "panel": "HereditaryCancer",
    "inheritance": "AD",
    "function": "Tumor suppressor",
    "disease_association": "Li-Fraumeni syndrome",
    "actionability": "Tier1"
  },
  {
    "gene_symbol": "MLH1",
    "panel": "Lynch",
    "inheritance": "AD",
    "function": "DNA mismatch repair",
    "disease_association": "Lynch syndrome",
    "actionability": "Tier1"
  }
]
```

#### Field Mapping Configuration

Use the `--json-gene-mapping` parameter to specify how JSON fields map to annotations:

```json
{
  "identifier": "gene_symbol",
  "dataFields": ["panel", "inheritance", "actionability"]
}
```

- `identifier`: The JSON field containing the gene symbol
- `dataFields`: Array of fields to include in the Custom_Annotation column

#### Usage Example

```bash
variantcentrifuge \
  --gene-file cancer_genes.txt \
  --vcf-file patient.vcf.gz \
  --annotate-json-genes gene_metadata.json \
  --json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","inheritance","actionability"]}' \
  --output-file annotated_variants.tsv
```

This will add annotations like `panel=HereditaryCancer;inheritance=AD;actionability=Tier1` to the Custom_Annotation column for variants in matching genes.

#### Adding JSON Annotations as Separate Columns

For a more structured output suitable for direct analysis, you can add each JSON data field as its own column instead of bundling them into the `Custom_Annotation` field. This is enabled with the `--json-genes-as-columns` flag.

**Usage Example:**

```bash
variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --annotate-json-genes gene_data.json \
  --json-gene-mapping '{"identifier":"symbol","dataFields":["ngs","actionability"]}' \
  --json-genes-as-columns \
  --output-file variants_with_columns.tsv
```

This command will produce a TSV with two new columns, `ngs` and `actionability`, populated with the corresponding values from `gene_data.json`.

### BED File Annotations

Annotate variants with genomic regions using BED files:

```bash
variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --annotate-bed hotspots.bed \
  --annotate-bed regulatory_regions.bed \
  --output-file output.tsv
```

### Gene List Annotations

Check if variants affect genes in custom lists:

```bash
variantcentrifuge \
  --gene-file all_genes.txt \
  --vcf-file input.vcf.gz \
  --annotate-gene-list actionable_genes.txt \
  --annotate-gene-list drug_targets.txt \
  --output-file output.tsv
```

### Combined Annotation Strategy

For comprehensive analysis, combine multiple annotation sources:

```bash
variantcentrifuge \
  --gene-file disease_genes.txt \
  --vcf-file patient.vcf.gz \
  --annotate-bed known_hotspots.bed \
  --annotate-gene-list clinically_actionable.txt \
  --annotate-json-genes gene_database.json \
  --json-gene-mapping '{"identifier":"symbol","dataFields":["panel","evidence","notes"]}' \
  --preset rare,coding \
  --html-report \
  --output-file comprehensive_analysis.tsv
```

### Integration with Scoring

Custom annotations can be used in variant scoring formulas. For example, if you annotate with `actionability` levels, you can create scoring formulas that prioritize Tier1 actionable variants.

## Best Practices Summary

1. **Plan your annotation strategy** based on your analysis goals
2. **Use the latest database versions** when possible
3. **Validate annotation completeness** before analysis
4. **Document your annotation workflow** for reproducibility
5. **Test on small datasets** before processing large cohorts
6. **Keep database versions consistent** across related analyses
7. **Consider population-specific databases** for diverse cohorts
8. **Backup original VCF files** before annotation
9. **Monitor resource usage** for large-scale annotations
10. **Validate critical variants manually** when needed
11. **Use custom annotations** to integrate project-specific gene metadata
12. **Combine annotation sources** for comprehensive variant characterization

By following these annotation strategies, you'll ensure that VariantCentrifuge has access to high-quality, comprehensive variant annotations for effective filtering and analysis.