Variant Scoring Guide¶

This guide covers how to create and use custom variant scoring models in VariantCentrifuge.

Overview¶

VariantCentrifuge’s scoring system allows you to apply custom mathematical models to score variants based on their annotations. This is useful for:

Prioritizing variants based on multiple criteria
Implementing published scoring algorithms
Creating disease-specific variant rankings
Combining multiple evidence types into a single score

How Scoring Works¶

The scoring system:

Loads configuration files that define variable mappings and formulas
Maps VCF annotation columns to formula variables
Applies the formula to calculate scores
Adds score columns to the output

Configuration Structure¶

A scoring configuration consists of two JSON files in a directory:

1. Variable Assignment Configuration¶

variable_assignment_config.json maps VCF columns to formula variables:

{
  "variables": {
    "VCF_COLUMN_NAME": "formula_variable|default:value"
  }
}

2. Formula Configuration¶

formula_config.json defines the scoring formulas:

{
  "formulas": [
    {
      "score_name": "formula_expression"
    }
  ]
}

Creating a Custom Scoring Model¶

Step 1: Identify Required Annotations¶

First, determine which VCF annotations you need for your scoring model:

Allele frequencies (e.g., dbNSFP_gnomAD_exomes_AF)
Pathogenicity scores (e.g., dbNSFP_CADD_phred, dbNSFP_REVEL_score)
Variant effects (e.g., ANN[0].EFFECT, ANN[0].IMPACT)
Conservation scores (e.g., dbNSFP_phyloP100way_vertebrate)

Step 2: Create Variable Mappings¶

Create a directory for your scoring configuration:

mkdir -p scoring/my_custom_score

Create variable_assignment_config.json:

{
  "variables": {
    "dbNSFP_gnomAD_exomes_AF": "af_exomes|default:0.0",
    "dbNSFP_CADD_phred": "cadd_score|default:0.0",
    "dbNSFP_REVEL_score": "revel_score|default:0.0",
    "ANN[0].IMPACT": "impact|default:'MODIFIER'"
  }
}

Step 3: Define Your Formula¶

Create formula_config.json:

{
  "formulas": [
    {
      "my_variant_score": "(cadd_score / 40) * 0.4 + (revel_score * 0.3) + ((1 - af_exomes) * 0.3)"
    }
  ]
}

Step 4: Test Your Scoring¶

variantcentrifuge \
  --gene-name BRCA1 \
  --vcf-file test.vcf.gz \
  --scoring-config-path scoring/my_custom_score \
  --output-file scored_test.tsv

Formula Syntax¶

Formulas use pandas eval syntax with these capabilities:

Basic Operations¶

Arithmetic: +, -, *, /, **
Comparisons: ==, !=, <, >, <=, >=
Logical: & (and), | (or), ~ (not)

Conditional Logic¶

# If-then-else using multiplication
"(impact == 'HIGH') * 1.0 + (impact == 'MODERATE') * 0.5"

# Multiple conditions
"((af < 0.01) & (cadd > 20)) * 1.0"

String Operations¶

# String matching
"consequence.str.contains('missense') * 0.5"

# Exact match
"(consequence == 'stop_gained') * 1.0"

Mathematical Functions¶

# Exponential (for logistic regression)
"1 / (1 + 2.718281828459045 ** (-linear_combination))"

# Min/Max bounds
"cadd_score.clip(0, 40) / 40"

Example: Logistic Regression Model¶

Here’s a complete example implementing a logistic regression model:

Variable Mappings¶

{
  "variables": {
    "dbNSFP_gnomAD_exomes_AF": "gnomad_af|default:0.0",
    "dbNSFP_CADD_phred": "cadd|default:0.0",
    "ANN[0].EFFECT": "effect|default:''",
    "ANN[0].IMPACT": "impact|default:''"
  }
}

Logistic Formula¶

{
  "formulas": [
    {
      "pathogenicity_score": "1 / (1 + 2.718281828459045 ** (-(2.5 + (cadd - 15) * 0.1 + (gnomad_af * -50) + ((effect == 'missense_variant') * 0.5) + ((impact == 'HIGH') * 2))))"
    }
  ]
}

Advanced Techniques¶

Multiple Scores¶

You can calculate multiple scores in one configuration:

{
  "formulas": [
    {
      "conservation_score": "(phylop + phastcons) / 2"
    },
    {
      "pathogenicity_score": "(cadd / 40) * 0.5 + (revel * 0.5)"
    },
    {
      "combined_score": "conservation_score * 0.3 + pathogenicity_score * 0.7"
    }
  ]
}

Handling Missing Data¶

Use default values in variable mappings:

{
  "variables": {
    "dbNSFP_CADD_phred": "cadd|default:10.0",
    "ClinVar_CLNSIG": "clinvar|default:'not_provided'"
  }
}

Categorical Variables¶

Convert categories to numeric values:

{
  "formulas": [
    {
      "impact_numeric": "(impact == 'HIGH') * 4 + (impact == 'MODERATE') * 3 + (impact == 'LOW') * 2 + (impact == 'MODIFIER') * 1"
    }
  ]
}

Real-World Example: Nephro Variant Score¶

The included scoring/nephro_variant_score configuration implements a sophisticated model for kidney disease variants:

# Use the nephro variant score
variantcentrifuge \
  --gene-file kidney_genes.txt \
  --vcf-file patient.vcf.gz \
  --scoring-config-path scoring/nephro_variant_score \
  --preset rare,coding \
  --html-report \
  --output-file kidney_analysis.tsv

This model:

Uses logistic regression with multiple predictors
Handles gnomAD frequencies from exomes and genomes
Incorporates CADD scores
Considers variant consequences (missense, frameshift, etc.)
Accounts for predicted impact levels

Best Practices¶

Validate Your Formulas
- Test on known pathogenic and benign variants
- Check score distributions make sense
- Verify handling of missing data
Document Your Model
- Include comments explaining the rationale
- Document expected score ranges
- Provide interpretation guidelines
Version Control
- Track scoring configurations in git
- Tag releases of scoring models
- Document changes between versions
Performance Considerations
- Keep formulas reasonably simple
- Avoid deeply nested conditions
- Test on large datasets

Troubleshooting¶

Common Issues¶

“Column not found” errors
- Check VCF has the required annotations
- Verify column names match exactly
- Use appropriate default values
Formula syntax errors
- Test formulas incrementally
- Check parentheses are balanced
- Verify string operations use proper syntax
Unexpected scores
- Check default values are appropriate
- Verify numeric conversions
- Test edge cases (0, 1, missing values)

Debugging Tips¶

Keep intermediate TSV files to inspect data:

variantcentrifuge --keep-intermediates ...

Start with simple formulas and build complexity:

{"test_score": "cadd"}  // Start simple
{"test_score": "cadd / 40"}  // Add normalization
{"test_score": "(cadd / 40) * 0.5 + (revel * 0.5)"}  // Combine

Use the Python REPL to test pandas expressions:

import pandas as pd
df = pd.DataFrame({'cadd': [10, 20, 30]})
df.eval("cadd / 40")

Integration with Other Features¶

Scoring works seamlessly with other VariantCentrifuge features:

Filtering: Apply filters before or after scoring
Gene Burden: Use scores in burden testing
Reports: Scores appear in HTML reports
IGV: Visualize high-scoring variants

Contributing Scoring Models¶

If you develop a useful scoring model, consider contributing it:

Create a well-documented configuration
Include example use cases
Provide validation results
Submit via GitHub pull request