# Configuration

VariantCentrifuge uses a JSON configuration file (`config.json`) to set default parameters and define reusable filter presets. This allows for flexible, reproducible analysis workflows.

## Configuration File Location

VariantCentrifuge looks for configuration files in the following order:

1. File specified with `--config` command-line option
2. `config.json` in the current working directory
3. Default configuration included with the package

## Configuration Structure

### Required Keys

These parameters must be provided either in the config file or via command-line arguments:

- **`reference`** (`str`) - Reference genome database for snpEff (e.g., "GRCh37.75", "GRCh38.99")
- **`filters`** (`str`) - SnpSift filter expression to select variants
- **`fields_to_extract`** (`str`) - Space-separated list of fields to extract via SnpSift

### Optional Keys

Configuration options with default values:

- **`interval_expand`** (`int`) - Number of bases to expand around genes. *Default: 0*
- **`add_chr`** (`bool`) - Add "chr" prefix to chromosome names. *Default: true*
- **`debug_level`** (`str`) - Logging level: "DEBUG", "INFO", "WARN", "ERROR". *Default: "INFO"*
- **`no_stats`** (`bool`) - Skip statistics computation. *Default: false*
- **`perform_gene_burden`** (`bool`) - Perform gene burden analysis. *Default: false*
- **`gene_burden_mode`** (`str`) - "samples" or "alleles". *Default: "alleles"*
- **`correction_method`** (`str`) - "fdr" or "bonferroni" for multiple testing correction. *Default: "fdr"*

### IGV Integration

- **`igv_enabled`** (`bool`) - Enable IGV.js integration. *Default: false*
- **`bam_mapping_file`** (`str`) - Path to TSV/CSV file mapping sample IDs to BAM files (required if igv_enabled=true)
- **`igv_reference`** (`str`) - Genome reference identifier for IGV (e.g., 'hg19', 'hg38')
- **`igv_fasta`** (`str`) - Path to local FASTA file for IGV reports (takes precedence over igv_reference)
- **`igv_ideogram`** (`str`) - Path to local ideogram file for IGV visualization
- **`igv_flanking`** (`int`) - Flanking bases around variants for IGV. *Default: 50*

### Variant Scoring

- **`scoring_config_path`** (`str`) - Path to directory containing scoring configuration files (variable_assignment_config.json and formula_config.json)

## Filter Presets

The configuration system supports predefined filter presets that can be combined and reused. Presets are defined in the `presets` section:

```json
{
  "presets": {
    "rare": "(((dbNSFP_gnomAD_exomes_AF[0] < 0.0001) | (na dbNSFP_gnomAD_exomes_AC[0])) & ((dbNSFP_gnomAD_genomes_AF[0] < 0.0001) | (na dbNSFP_gnomAD_genomes_AC[0])))",
    "coding": "((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))",
    "pathogenic": "(((dbNSFP_clinvar_clnsig =~ '[Pp]athogenic') & !(dbNSFP_clinvar_clnsig =~ '[Cc]onflicting')) | ((ClinVar_CLNSIG =~ '[Pp]athogenic') & !(ClinVar_CLNSIG =~ '[Cc]onflicting')))"
  }
}
```

### Using Presets

Presets can be used with the `--preset` command-line option:

```bash
# Single preset
variantcentrifuge --preset rare --gene-name BRCA1 --vcf-file input.vcf

# Multiple presets (combined with AND)
variantcentrifuge --preset rare,coding --gene-name BRCA1 --vcf-file input.vcf

# Combine presets with custom filters
variantcentrifuge --preset rare --filters "QUAL >= 100" --gene-name BRCA1 --vcf-file input.vcf
```

### Built-in Presets

The default configuration includes many useful presets:

#### Rarity Filters
- **`super_rare`** - AC ≤ 2 in gnomAD exomes and genomes
- **`rare`** - AF < 0.0001 in gnomAD exomes and genomes  
- **`1percent`** - AF < 0.001 in gnomAD exomes and genomes
- **`5percent`** - AF < 0.05 in gnomAD exomes and genomes

#### Impact Filters
- **`high`** - High impact variants only
- **`moderate`** - Moderate impact variants only
- **`coding`** - High OR moderate impact variants
- **`high_or_moderate_or_low`** - Any protein-coding impact

#### Clinical Significance
- **`pathogenic`** - ClinVar pathogenic variants
- **`not_benign`** - Exclude ClinVar benign variants
- **`pathogenic_or_rare`** - Pathogenic OR rare variants

#### Quality Filters
- **`not_artefact`** - Quality-based artifact filtering
- **`mutect2_TvsN_pass`** - Tumor vs Normal Mutect2 filters with PASS only

## Example Configurations

### Basic Research Configuration

```json
{
  "reference": "GRCh37.75",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_gnomAD_exomes_AC GEN[*].GT",
  "interval_expand": 1000,
  "add_chr": false,
  "perform_gene_burden": true
}
```

### Clinical Analysis Configuration

```json
{
  "reference": "GRCh38.99",
  "filters": "((dbNSFP_gnomAD_exomes_AF[0] < 0.01) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_CADD_phred dbNSFP_REVEL_score ClinVar_CLNSIG GEN[*].GT",
  "add_chr": true,
  "igv_enabled": true,
  "igv_reference": "hg38"
}
```

### Somatic Variant Configuration

```json
{
  "reference": "GRCh38.99", 
  "filters": "(GEN[0].AF < 0.03) & (GEN[1].AF >= 0.05) & (FILTER = 'PASS')",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P GEN[*].GT GEN[*].AF GEN[*].DP",
  "presets": {
    "somatic_quality": "(GEN[0].AF < 0.03) & (GEN[1].AF >= 0.05) & (GEN[*].DP >= 50)",
    "cosmic_or_rare": "(((dbNSFP_gnomAD_exomes_AC[0] <= 2) | (na dbNSFP_gnomAD_exomes_AC[0])) | (exists ID & ID =~ 'COS'))"
  }
}
```

## External Database Links

Configure links to external databases that will be added to HTML reports:

```json
{
  "links": {
    "SpliceAI": "https://spliceailookup.broadinstitute.org/#variant={CHROM}-{POS}-{REF}-{ALT}&hg=19",
    "Franklin": "https://franklin.genoox.com/clinical-db/variant/snp/{CHROM}-{POS}-{REF}-{ALT}-hg19",
    "Varsome": "https://varsome.com/variant/hg19/{CHROM}-{POS}-{REF}-{ALT}",
    "gnomAD": "https://gnomad.broadinstitute.org/variant/{CHROM}-{POS}-{REF}-{ALT}",
    "ClinVar": "https://www.ncbi.nlm.nih.gov/clinvar/?term={CHROM}-{POS}-{REF}-{ALT}"
  }
}
```

## HTML Report Customization

Control which columns are hidden by default in HTML reports:

```json
{
  "html_report_default_hidden_columns": [
    "QUAL", "AC", "FEATUREID", "AA_POS", "AA_LEN"
  ],
  "html_report_truncate_settings": {
    "default_max_width_px": 120,
    "columns_for_hover_expand": ["HGVS_P", "HGVS_C", "EFFECT"],
    "column_specific_max_widths_px": {
      "GT": 250
    }
  }
}
```

## Configuration Validation

VariantCentrifuge validates your configuration and provides helpful error messages:

- Missing required keys will trigger clear error messages
- Invalid preset references will be caught at startup
- Filter expression syntax is validated when possible

## Variant Scoring Configuration

VariantCentrifuge supports custom variant scoring through configuration files. The scoring system requires two JSON files in a directory:

### Variable Assignment Configuration

`variable_assignment_config.json` maps VCF annotation fields to formula variables:

```json
{
  "variables": {
    "dbNSFP_gnomAD_exomes_AF": "gnomade_variant|default:0.0",
    "dbNSFP_gnomAD_genomes_AF": "gnomadg_variant|default:0.0",
    "dbNSFP_CADD_phred": "cadd_phred_variant|default:0.0",
    "ANN[0].EFFECT": "consequence_terms_variant|default:''",
    "ANN[0].IMPACT": "impact_variant|default:''"
  }
}
```

Each mapping specifies:
- **Key**: The column name in your VCF data
- **Value**: The variable name for the formula, with optional default value

### Formula Configuration

`formula_config.json` contains the scoring formulas:

```json
{
  "formulas": [
    {
      "score_name": "formula_expression"
    }
  ]
}
```

Formulas use pandas eval syntax and can include:
- Mathematical operations
- Conditional logic (`(condition) * value`)
- String operations (`.str.contains()`)
- Any valid pandas expression

### Example: Nephro Variant Score

A complete scoring configuration for kidney disease variants:

```json
// variable_assignment_config.json
{
  "variables": {
    "dbNSFP_gnomAD_exomes_AF": "gnomade_variant|default:0.0",
    "dbNSFP_gnomAD_genomes_AF": "gnomadg_variant|default:0.0",
    "dbNSFP_CADD_phred": "cadd_phred_variant|default:0.0",
    "ANN[0].EFFECT": "consequence_terms_variant|default:''",
    "ANN[0].IMPACT": "impact_variant|default:''"
  }
}

// formula_config.json
{
  "formulas": [
    {
      "nephro_variant_score": "1 / (1 + 2.718281828459045 ** (-((-36.30796) + ((gnomade_variant - 0.00658) / 0.05959) * (-309.33539) + ((gnomadg_variant - 0.02425) / 0.11003) * (-2.54581) + (((consequence_terms_variant == 'missense_variant') * 1.0 - 0.24333) / 0.42909) * (-1.14313) + ((cadd_phred_variant - 12.47608) / 11.78359) * 2.68520 + ((((impact_variant == 'HIGH') * 4 + (impact_variant == 'MODERATE') * 3 + (impact_variant == 'LOW') * 2 + (impact_variant == 'MODIFIER') * 1) - 2.49999) / 1.11804) * 3.14822)))"
    }
  ]
}
```

### Using Scoring

To apply scoring to your analysis:

```bash
variantcentrifuge \
  --gene-name GENE \
  --vcf-file input.vcf.gz \
  --scoring-config-path /path/to/scoring/config/dir \
  --output-file scored_variants.tsv
```

The scoring module will:
1. Load the configuration files from the specified directory
2. Map VCF columns to formula variables
3. Apply the formula to calculate scores
4. Add score columns to the output

## Best Practices

1. **Use version control** - Store your configuration files in version control alongside your analysis scripts
2. **Environment-specific configs** - Use different configurations for development, testing, and production
3. **Document custom presets** - Add comments explaining complex filter logic
4. **Test filters** - Validate filter expressions on small datasets before large analyses
5. **Modular approach** - Use presets to build complex filters from simpler components
6. **Test scoring formulas** - Validate scoring on known variants before production use