Rare Disease Analysis Workflow¶
This guide provides a comprehensive workflow for analyzing rare disease variants using VariantCentrifuge, from initial data preparation through final interpretation.
Overview¶
Rare disease variant analysis requires careful filtering to identify potentially causative variants while minimizing false positives. This workflow focuses on:
Identifying rare, high-impact variants
Prioritizing variants in known disease genes
Leveraging clinical databases for variant interpretation
Generating comprehensive reports for clinical review
Prerequisites¶
Annotated VCF files (see Annotation Strategies)
Disease gene lists (e.g., OMIM, ClinGen, custom panels)
Patient phenotype information
Family information (if available)
Step-by-Step Workflow¶
Step 1: Prepare Gene Lists¶
# Create disease-specific gene lists
echo "BRCA1\nBRCA2\nTP53\nATM\nCHEK2" > breast_cancer_genes.txt
# Or use comprehensive panels
wget https://ftp.clinicalgenome.org/ClinGen_gene_curation_list.tsv
awk -F'\t' '$3 == "Breast cancer" {print $1}' ClinGen_gene_curation_list.tsv > clinical_breast_genes.txt
Step 2: Configure Analysis¶
Create a rare disease-specific configuration:
{
"reference": "GRCh38.99",
"filters": "",
"fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].FEATUREID ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score gnomAD_exomes_AF gnomAD_genomes_AF ClinVar_CLNSIG ClinVar_CLNDN HGMD_CLASS GEN[*].GT GEN[*].DP",
"interval_expand": 20,
"perform_gene_burden": true,
"presets": {
"rare_pathogenic": "(((gnomAD_exomes_AF < 0.001) | (na gnomAD_exomes_AF)) & ((gnomAD_genomes_AF < 0.001) | (na gnomAD_genomes_AF))) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE') | ((ClinVar_CLNSIG =~ '[Pp]athogenic') & !(ClinVar_CLNSIG =~ '[Cc]onflicting')))",
"ultra_rare": "(((gnomAD_exomes_AC <= 2) | (na gnomAD_exomes_AC)) & ((gnomAD_genomes_AC <= 2) | (na gnomAD_genomes_AC)))",
"high_confidence": "(GEN[*].DP >= 20) & (GEN[*].GQ >= 20) & (QUAL >= 30)",
"protein_affecting": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
"clinical_significance": "((ClinVar_CLNSIG =~ '[Pp]athogenic') | (ClinVar_CLNSIG =~ 'VUS') | (ClinVar_CLNSIG =~ '[Ll]ikely'))"
}
}
Step 3: Run Initial Analysis¶
# Single patient analysis
variantcentrifuge \
--config rare_disease_config.json \
--gene-file disease_genes.txt \
--vcf-file patient_001.vcf.gz \
--preset rare_pathogenic,high_confidence \
--output-file patient_001_rare_variants.tsv \
--html-report \
--xlsx
Step 4: Inheritance Pattern Analysis¶
For family-based analysis:
# Trio analysis (proband + parents)
variantcentrifuge \
--config rare_disease_config.json \
--gene-file disease_genes.txt \
--vcf-file family_trio.vcf.gz \
--preset ultra_rare,protein_affecting \
--samples-file trio_samples.txt \
--phenotype-file family_phenotypes.tsv \
--phenotype-sample-column "sample_id" \
--phenotype-value-column "affected_status" \
--output-file trio_analysis.tsv \
--perform-gene-burden \
--html-report
Trio samples file (trio_samples.txt
):
proband_001 Proband
father_001 Father
mother_001 Mother
Family phenotypes file (family_phenotypes.tsv
):
sample_id affected_status age sex
phenotype_details
proband_001 affected 25 F intellectual disability, seizures
father_001 unaffected 55 M normal
mother_001 unaffected 52 F normal
Step 5: Multi-Gene Panel Analysis¶
# Comprehensive gene panel analysis
variantcentrifuge \
--config rare_disease_config.json \
--gene-file comprehensive_panel.txt \
--vcf-file patient_001.vcf.gz \
--preset clinical_significance \
--filters "(GEN[*].DP >= 15) & (QUAL >= 20)" \
--output-file patient_001_panel.tsv \
--igv \
--bam-mapping-file patient_bams.tsv \
--igv-reference hg38 \
--html-report
Interpretation Guidelines¶
Variant Prioritization¶
Pathogenic/Likely Pathogenic in ClinVar
Highest priority for known disease variants
Review evidence and conflicting interpretations
Loss-of-Function in Disease Genes
Nonsense, frameshift, splice site variants
High confidence if gene is known for haploinsufficiency
Missense with Strong Predictions
CADD > 20, REVEL > 0.7
Affecting conserved domains
Novel or ultra-rare (AC ≤ 2)
Splice Region Variants
Within 2bp of exon boundaries
Novel splice predictions
Experimental validation recommended
Filtering Strategy¶
# Tier 1: Known pathogenic variants
variantcentrifuge \
--preset clinical_significance \
--filters "(ClinVar_CLNSIG =~ '[Pp]athogenic') & !(ClinVar_CLNSIG =~ '[Cc]onflicting')" \
--gene-file disease_genes.txt \
--vcf-file patient.vcf.gz \
--output-file tier1_pathogenic.tsv
# Tier 2: High-impact rare variants
variantcentrifuge \
--preset ultra_rare,protein_affecting \
--filters "(ANN[ANY].IMPACT has 'HIGH') & (GEN[*].DP >= 20)" \
--gene-file disease_genes.txt \
--vcf-file patient.vcf.gz \
--output-file tier2_high_impact.tsv
# Tier 3: Moderate impact with strong predictions
variantcentrifuge \
--preset rare_pathogenic \
--filters "(ANN[ANY].IMPACT has 'MODERATE') & ((dbNSFP_CADD_phred >= 25) | (dbNSFP_REVEL_score >= 0.7))" \
--gene-file disease_genes.txt \
--vcf-file patient.vcf.gz \
--output-file tier3_moderate_predicted.tsv
Case Studies¶
Case 1: Intellectual Disability¶
Patient: 8-year-old with developmental delay and seizures
# Analysis focusing on neurodevelopmental genes
variantcentrifuge \
--config rare_disease_config.json \
--gene-file intellectual_disability_genes.txt \
--vcf-file patient_ID001.vcf.gz \
--preset ultra_rare,protein_affecting \
--output-file ID001_neurodevelopmental.tsv \
--html-report
Key findings:
De novo nonsense variant in SCN1A (Dravet syndrome)
Ultra-rare missense variant in STXBP1 with CADD=28
Multiple VUS requiring functional studies
Case 2: Cardiomyopathy¶
Patient: 35-year-old with hypertrophic cardiomyopathy
# Cardiomyopathy gene panel analysis
variantcentrifuge \
--config rare_disease_config.json \
--gene-file cardiomyopathy_genes.txt \
--vcf-file patient_CM002.vcf.gz \
--preset clinical_significance \
--filters "(ClinVar_CLNSIG =~ '[Pp]athogenic|VUS') | ((gnomAD_exomes_AF < 0.0001) & (ANN[ANY].IMPACT has 'HIGH|MODERATE'))" \
--output-file CM002_cardiomyopathy.tsv \
--igv \
--html-report
Key findings:
Pathogenic variant in MYBPC3 (known HCM gene)
Family segregation analysis recommended
Cascade screening for relatives
Quality Control¶
Sample Quality Metrics¶
# Check coverage and quality metrics
variantcentrifuge \
--preset high_confidence \
--filters "(GEN[*].DP >= 20) & (GEN[*].GQ >= 20)" \
--fields "CHROM POS REF ALT GEN[*].DP GEN[*].GQ GEN[*].AD" \
--gene-file disease_genes.txt \
--vcf-file patient.vcf.gz \
--output-file quality_check.tsv
Validation Requirements¶
Sanger sequencing for pathogenic/likely pathogenic variants
Family segregation when family samples available
Functional studies for novel VUS in critical genes
CNV analysis for genes with known deletion/duplication syndromes
Reporting Templates¶
Clinical Report Structure¶
Patient Information
Demographics and phenotype
Family history
Indication for testing
Methods
Sequencing platform and coverage
Analysis pipeline and filters
Gene panel composition
Results
Pathogenic/likely pathogenic variants
Variants of uncertain significance
Negative findings in relevant genes
Interpretation
Clinical significance
Inheritance pattern
Recommendations
Automated Report Generation¶
# Generate clinical report with key findings
variantcentrifuge \
--config clinical_report_config.json \
--gene-file clinical_panel.txt \
--vcf-file patient.vcf.gz \
--preset clinical_significance \
--phenotype-file patient_phenotype.tsv \
--output-file clinical_report.tsv \
--html-report \
--xlsx
Best Practices¶
Use validated gene panels for specific conditions
Apply appropriate frequency thresholds based on disease prevalence
Consider inheritance patterns in filtering strategy
Validate critical findings with orthogonal methods
Document analysis parameters for reproducibility
Regular database updates for clinical annotations
Multidisciplinary review of complex cases
Genetic counseling for positive findings
Common Pitfalls¶
Over-reliance on prediction tools without experimental validation
Ignoring population-specific frequencies in diverse cohorts
Inadequate coverage assessment in critical gene regions
Missing structural variants in single nucleotide variant analysis
Insufficient phenotype documentation for variant interpretation
By following this workflow, you can systematically analyze rare disease variants and generate clinically actionable reports using VariantCentrifuge.