Variant Scoring Guide¶
This guide covers how to create and use custom variant scoring models in VariantCentrifuge.
Overview¶
VariantCentrifuge’s scoring system allows you to apply custom mathematical models to score variants based on their annotations. This is useful for:
Prioritizing variants based on multiple criteria
Implementing published scoring algorithms
Creating disease-specific variant rankings
Combining multiple evidence types into a single score
How Scoring Works¶
The scoring system:
Loads configuration files that define variable mappings and formulas
Maps VCF annotation columns to formula variables
Applies the formula to calculate scores
Adds score columns to the output
Configuration Structure¶
A scoring configuration consists of two JSON files in a directory:
1. Variable Assignment Configuration¶
variable_assignment_config.json
maps VCF columns to formula variables:
{
"variables": {
"VCF_COLUMN_NAME": "formula_variable|default:value"
}
}
2. Formula Configuration¶
formula_config.json
defines the scoring formulas:
{
"formulas": [
{
"score_name": "formula_expression"
}
]
}
Creating a Custom Scoring Model¶
Step 1: Identify Required Annotations¶
First, determine which VCF annotations you need for your scoring model:
Allele frequencies (e.g.,
dbNSFP_gnomAD_exomes_AF
)Pathogenicity scores (e.g.,
dbNSFP_CADD_phred
,dbNSFP_REVEL_score
)Variant effects (e.g.,
ANN[0].EFFECT
,ANN[0].IMPACT
)Conservation scores (e.g.,
dbNSFP_phyloP100way_vertebrate
)
Step 2: Create Variable Mappings¶
Create a directory for your scoring configuration:
mkdir -p scoring/my_custom_score
Create variable_assignment_config.json
:
{
"variables": {
"dbNSFP_gnomAD_exomes_AF": "af_exomes|default:0.0",
"dbNSFP_CADD_phred": "cadd_score|default:0.0",
"dbNSFP_REVEL_score": "revel_score|default:0.0",
"ANN[0].IMPACT": "impact|default:'MODIFIER'"
}
}
Step 3: Define Your Formula¶
Create formula_config.json
:
{
"formulas": [
{
"my_variant_score": "(cadd_score / 40) * 0.4 + (revel_score * 0.3) + ((1 - af_exomes) * 0.3)"
}
]
}
Step 4: Test Your Scoring¶
variantcentrifuge \
--gene-name BRCA1 \
--vcf-file test.vcf.gz \
--scoring-config-path scoring/my_custom_score \
--output-file scored_test.tsv
Formula Syntax¶
Formulas use pandas eval syntax with these capabilities:
Basic Operations¶
Arithmetic:
+
,-
,*
,/
,**
Comparisons:
==
,!=
,<
,>
,<=
,>=
Logical:
&
(and),|
(or),~
(not)
Conditional Logic¶
# If-then-else using multiplication
"(impact == 'HIGH') * 1.0 + (impact == 'MODERATE') * 0.5"
# Multiple conditions
"((af < 0.01) & (cadd > 20)) * 1.0"
String Operations¶
# String matching
"consequence.str.contains('missense') * 0.5"
# Exact match
"(consequence == 'stop_gained') * 1.0"
Mathematical Functions¶
# Exponential (for logistic regression)
"1 / (1 + 2.718281828459045 ** (-linear_combination))"
# Min/Max bounds
"cadd_score.clip(0, 40) / 40"
Example: Logistic Regression Model¶
Here’s a complete example implementing a logistic regression model:
Variable Mappings¶
{
"variables": {
"dbNSFP_gnomAD_exomes_AF": "gnomad_af|default:0.0",
"dbNSFP_CADD_phred": "cadd|default:0.0",
"ANN[0].EFFECT": "effect|default:''",
"ANN[0].IMPACT": "impact|default:''"
}
}
Logistic Formula¶
{
"formulas": [
{
"pathogenicity_score": "1 / (1 + 2.718281828459045 ** (-(2.5 + (cadd - 15) * 0.1 + (gnomad_af * -50) + ((effect == 'missense_variant') * 0.5) + ((impact == 'HIGH') * 2))))"
}
]
}
Advanced Techniques¶
Multiple Scores¶
You can calculate multiple scores in one configuration:
{
"formulas": [
{
"conservation_score": "(phylop + phastcons) / 2"
},
{
"pathogenicity_score": "(cadd / 40) * 0.5 + (revel * 0.5)"
},
{
"combined_score": "conservation_score * 0.3 + pathogenicity_score * 0.7"
}
]
}
Handling Missing Data¶
Use default values in variable mappings:
{
"variables": {
"dbNSFP_CADD_phred": "cadd|default:10.0",
"ClinVar_CLNSIG": "clinvar|default:'not_provided'"
}
}
Categorical Variables¶
Convert categories to numeric values:
{
"formulas": [
{
"impact_numeric": "(impact == 'HIGH') * 4 + (impact == 'MODERATE') * 3 + (impact == 'LOW') * 2 + (impact == 'MODIFIER') * 1"
}
]
}
Real-World Example: Nephro Variant Score¶
The included scoring/nephro_variant_score
configuration implements a sophisticated model for kidney disease variants:
# Use the nephro variant score
variantcentrifuge \
--gene-file kidney_genes.txt \
--vcf-file patient.vcf.gz \
--scoring-config-path scoring/nephro_variant_score \
--preset rare,coding \
--html-report \
--output-file kidney_analysis.tsv
This model:
Uses logistic regression with multiple predictors
Handles gnomAD frequencies from exomes and genomes
Incorporates CADD scores
Considers variant consequences (missense, frameshift, etc.)
Accounts for predicted impact levels
Best Practices¶
Validate Your Formulas
Test on known pathogenic and benign variants
Check score distributions make sense
Verify handling of missing data
Document Your Model
Include comments explaining the rationale
Document expected score ranges
Provide interpretation guidelines
Version Control
Track scoring configurations in git
Tag releases of scoring models
Document changes between versions
Performance Considerations
Keep formulas reasonably simple
Avoid deeply nested conditions
Test on large datasets
Troubleshooting¶
Common Issues¶
“Column not found” errors
Check VCF has the required annotations
Verify column names match exactly
Use appropriate default values
Formula syntax errors
Test formulas incrementally
Check parentheses are balanced
Verify string operations use proper syntax
Unexpected scores
Check default values are appropriate
Verify numeric conversions
Test edge cases (0, 1, missing values)
Debugging Tips¶
Keep intermediate TSV files to inspect data:
variantcentrifuge --keep-intermediates ...
Start with simple formulas and build complexity:
{"test_score": "cadd"} // Start simple {"test_score": "cadd / 40"} // Add normalization {"test_score": "(cadd / 40) * 0.5 + (revel * 0.5)"} // Combine
Use the Python REPL to test pandas expressions:
import pandas as pd df = pd.DataFrame({'cadd': [10, 20, 30]}) df.eval("cadd / 40")
Integration with Other Features¶
Scoring works seamlessly with other VariantCentrifuge features:
Filtering: Apply filters before or after scoring
Gene Burden: Use scores in burden testing
Reports: Scores appear in HTML reports
IGV: Visualize high-scoring variants
Contributing Scoring Models¶
If you develop a useful scoring model, consider contributing it:
Create a well-documented configuration
Include example use cases
Provide validation results
Submit via GitHub pull request
Further Reading¶
Configuration Guide - General configuration options
API Reference - Technical documentation
pandas.eval documentation - Formula syntax reference