Configuration¶

VariantCentrifuge uses a JSON configuration file (config.json) to set default parameters and define reusable filter presets. This allows for flexible, reproducible analysis workflows.

Configuration File Location¶

VariantCentrifuge looks for configuration files in the following order:

File specified with --config command-line option
config.json in the current working directory
Default configuration included with the package

Configuration Structure¶

Required Keys¶

These parameters must be provided either in the config file or via command-line arguments:

reference (str) - Reference genome database for snpEff (e.g., “GRCh37.75”, “GRCh38.99”)
filters (str) - SnpSift filter expression to select variants
fields_to_extract (str) - Space-separated list of fields to extract via SnpSift

Optional Keys¶

Configuration options with default values:

interval_expand (int) - Number of bases to expand around genes. Default: 0
add_chr (bool) - Add “chr” prefix to chromosome names. Default: true
debug_level (str) - Logging level: “DEBUG”, “INFO”, “WARN”, “ERROR”. Default: “INFO”
no_stats (bool) - Skip statistics computation. Default: false
perform_gene_burden (bool) - Perform gene burden analysis. Default: false
gene_burden_mode (str) - “samples” or “alleles”. Default: “alleles”
correction_method (str) - “fdr” or “bonferroni” for multiple testing correction. Default: “fdr”

IGV Integration¶

igv_enabled (bool) - Enable IGV.js integration. Default: false
bam_mapping_file (str) - Path to TSV/CSV file mapping sample IDs to BAM files (required if igv_enabled=true)
igv_reference (str) - Genome reference identifier for IGV (e.g., ‘hg19’, ‘hg38’)
igv_fasta (str) - Path to local FASTA file for IGV reports (takes precedence over igv_reference)
igv_ideogram (str) - Path to local ideogram file for IGV visualization
igv_flanking (int) - Flanking bases around variants for IGV. Default: 50

Variant Scoring¶

scoring_config_path (str) - Path to directory containing scoring configuration files (variable_assignment_config.json and formula_config.json)

Filter Presets¶

The configuration system supports predefined filter presets that can be combined and reused. Presets are defined in the presets section:

{
  "presets": {
    "rare": "(((dbNSFP_gnomAD_exomes_AF[0] < 0.0001) | (na dbNSFP_gnomAD_exomes_AC[0])) & ((dbNSFP_gnomAD_genomes_AF[0] < 0.0001) | (na dbNSFP_gnomAD_genomes_AC[0])))",
    "coding": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
    "pathogenic": "(((dbNSFP_clinvar_clnsig =~ '[Pp]athogenic') & !(dbNSFP_clinvar_clnsig =~ '[Cc]onflicting')) | ((ClinVar_CLNSIG =~ '[Pp]athogenic') & !(ClinVar_CLNSIG =~ '[Cc]onflicting')))"
  }
}

Using Presets¶

Presets can be used with the --preset command-line option:

# Single preset
variantcentrifuge --preset rare --gene-name BRCA1 --vcf-file input.vcf

# Multiple presets (combined with AND)
variantcentrifuge --preset rare,coding --gene-name BRCA1 --vcf-file input.vcf

# Combine presets with custom filters
variantcentrifuge --preset rare --filters "QUAL >= 100" --gene-name BRCA1 --vcf-file input.vcf

Built-in Presets¶

The default configuration includes many useful presets:

Rarity Filters¶

super_rare - AC ≤ 2 in gnomAD exomes and genomes
rare - AF < 0.0001 in gnomAD exomes and genomes
1percent - AF < 0.001 in gnomAD exomes and genomes
5percent - AF < 0.05 in gnomAD exomes and genomes

Impact Filters¶

high - High impact variants only
moderate - Moderate impact variants only
coding - High OR moderate impact variants
high_or_moderate_or_low - Any protein-coding impact

Clinical Significance¶

pathogenic - ClinVar pathogenic variants
not_benign - Exclude ClinVar benign variants
pathogenic_or_rare - Pathogenic OR rare variants

Quality Filters¶

not_artefact - Quality-based artifact filtering
mutect2_TvsN_pass - Tumor vs Normal Mutect2 filters with PASS only

Example Configurations¶

Basic Research Configuration¶

{
  "reference": "GRCh37.75",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_gnomAD_exomes_AC GEN[*].GT",
  "interval_expand": 1000,
  "add_chr": false,
  "perform_gene_burden": true
}

Clinical Analysis Configuration¶

{
  "reference": "GRCh38.99",
  "filters": "((dbNSFP_gnomAD_exomes_AF[0] < 0.01) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score ClinVar_CLNSIG GEN[*].GT",
  "add_chr": true,
  "igv_enabled": true,
  "igv_reference": "hg38"
}

Somatic Variant Configuration¶

{
  "reference": "GRCh38.99", 
  "filters": "(GEN[0].AF < 0.03) & (GEN[1].AF >= 0.05) & (FILTER = 'PASS')",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P GEN[*].GT GEN[*].AF GEN[*].DP",
  "presets": {
    "somatic_quality": "(GEN[0].AF < 0.03) & (GEN[1].AF >= 0.05) & (GEN[*].DP >= 50)",
    "cosmic_or_rare": "(((dbNSFP_gnomAD_exomes_AC[0] <= 2) | (na dbNSFP_gnomAD_exomes_AC[0])) | (exists ID & ID =~ 'COS'))"
  }
}

External Database Links¶

Configure links to external databases that will be added to HTML reports:

{
  "links": {
    "SpliceAI": "https://spliceailookup.broadinstitute.org/#variant={CHROM}-{POS}-{REF}-{ALT}&hg=19",
    "Franklin": "https://franklin.genoox.com/clinical-db/variant/snp/{CHROM}-{POS}-{REF}-{ALT}-hg19",
    "Varsome": "https://varsome.com/variant/hg19/{CHROM}-{POS}-{REF}-{ALT}",
    "gnomAD": "https://gnomad.broadinstitute.org/variant/{CHROM}-{POS}-{REF}-{ALT}",
    "ClinVar": "https://www.ncbi.nlm.nih.gov/clinvar/?term={CHROM}-{POS}-{REF}-{ALT}"
  }
}

HTML Report Customization¶

Control which columns are hidden by default in HTML reports:

{
  "html_report_default_hidden_columns": [
    "QUAL", "AC", "FEATUREID", "AA_POS", "AA_LEN"
  ],
  "html_report_truncate_settings": {
    "default_max_width_px": 120,
    "columns_for_hover_expand": ["HGVS_P", "HGVS_C", "EFFECT"],
    "column_specific_max_widths_px": {
      "GT": 250
    }
  }
}

Configuration Validation¶

VariantCentrifuge validates your configuration and provides helpful error messages:

Missing required keys will trigger clear error messages
Invalid preset references will be caught at startup
Filter expression syntax is validated when possible

Variant Scoring Configuration¶

VariantCentrifuge supports custom variant scoring through configuration files. The scoring system requires two JSON files in a directory:

Variable Assignment Configuration¶

variable_assignment_config.json maps VCF annotation fields to formula variables:

{
  "variables": {
    "dbNSFP_gnomAD_exomes_AF": "gnomade_variant|default:0.0",
    "dbNSFP_gnomAD_genomes_AF": "gnomadg_variant|default:0.0",
    "dbNSFP_CADD_phred": "cadd_phred_variant|default:0.0",
    "ANN[0].EFFECT": "consequence_terms_variant|default:''",
    "ANN[0].IMPACT": "impact_variant|default:''"
  }
}

Each mapping specifies:

Key: The column name in your VCF data
Value: The variable name for the formula, with optional default value

Formula Configuration¶

formula_config.json contains the scoring formulas:

{
  "formulas": [
    {
      "score_name": "formula_expression"
    }
  ]
}

Formulas use pandas eval syntax and can include:

Mathematical operations
Conditional logic ((condition) * value)
String operations (.str.contains())
Any valid pandas expression

Example: Nephro Variant Score¶

A complete scoring configuration for kidney disease variants:

// variable_assignment_config.json
{
  "variables": {
    "dbNSFP_gnomAD_exomes_AF": "gnomade_variant|default:0.0",
    "dbNSFP_gnomAD_genomes_AF": "gnomadg_variant|default:0.0",
    "dbNSFP_CADD_phred": "cadd_phred_variant|default:0.0",
    "ANN[0].EFFECT": "consequence_terms_variant|default:''",
    "ANN[0].IMPACT": "impact_variant|default:''"
  }
}

// formula_config.json
{
  "formulas": [
    {
      "nephro_variant_score": "1 / (1 + 2.718281828459045 ** (-((-36.30796) + ((gnomade_variant - 0.00658) / 0.05959) * (-309.33539) + ((gnomadg_variant - 0.02425) / 0.11003) * (-2.54581) + (((consequence_terms_variant == 'missense_variant') * 1.0 - 0.24333) / 0.42909) * (-1.14313) + ((cadd_phred_variant - 12.47608) / 11.78359) * 2.68520 + ((((impact_variant == 'HIGH') * 4 + (impact_variant == 'MODERATE') * 3 + (impact_variant == 'LOW') * 2 + (impact_variant == 'MODIFIER') * 1) - 2.49999) / 1.11804) * 3.14822)))"
    }
  ]
}

Using Scoring¶

To apply scoring to your analysis:

variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --scoring-config-path /path/to/scoring/config/dir \
  --output-file scored_variants.tsv

The scoring module will:

Load the configuration files from the specified directory
Map VCF columns to formula variables
Apply the formula to calculate scores
Add score columns to the output

Best Practices¶

Use version control - Store your configuration files in version control alongside your analysis scripts
Environment-specific configs - Use different configurations for development, testing, and production
Document custom presets - Add comments explaining complex filter logic
Test filters - Validate filter expressions on small datasets before large analyses
Modular approach - Use presets to build complex filters from simpler components
Test scoring formulas - Validate scoring on known variants before production use