Performance Optimization

Tips for optimizing VariantCentrifuge performance with large datasets.

Hardware Recommendations

  • Memory: 16GB+ RAM for large VCF files

  • Storage: SSD for intermediate files

  • CPU: Multi-core for parallel processing

Performance Features

bcftools Pre-filtering

The --bcftools-prefilter option allows you to apply fast bcftools filters during the initial variant extraction step, significantly reducing the amount of data processed by subsequent steps:

# Filter out low-quality and common variants early
variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file large_cohort.vcf.gz \
  --bcftools-prefilter 'FILTER="PASS" && QUAL>20 && INFO/AC<10' \
  --preset rare,coding \
  --output-file results.tsv

Benefits:

  • Reduces memory usage by filtering early in the pipeline

  • Speeds up SnpSift filtering and field extraction

  • Particularly effective for large cohort VCFs

bcftools Filter Syntax:

  • FILTER="PASS" - Only PASS variants

  • QUAL>20 - Quality score threshold

  • INFO/AC<10 - Allele count less than 10

  • INFO/AF<0.01 - Allele frequency less than 1%

  • Combine with && (AND) or || (OR)

Optimized Filtering Workflow

For maximum performance with large datasets:

# Combine bcftools pre-filter with late filtering
variantcentrifuge \
  --gene-file large_gene_list.txt \
  --vcf-file cohort_1000_samples.vcf.gz \
  --bcftools-prefilter 'FILTER="PASS" && INFO/AC<20' \
  --late-filtering \
  --preset rare,coding \
  --scoring-config-path scoring/my_scores \
  --filters "score > 0.5" \
  --output-file high_score_rare.tsv

Large Dataset Strategies

Chromosome-based Processing

# Split by chromosome
for chr in {1..22} X Y; do
    bcftools view -r chr${chr} large_cohort.vcf.gz | \
    variantcentrifuge \
        --gene-file genes.txt \
        --vcf-file /dev/stdin \
        --output-file chr${chr}_results.tsv &
done
wait

# Merge results
cat chr*_results.tsv > combined_results.tsv

Memory Management

# Increase Java heap size
export JAVA_OPTS="-Xmx16g"

# Use streaming for large files
bcftools view large.vcf.gz | variantcentrifuge --vcf-file /dev/stdin ...