Performance Optimization¶
Tips for optimizing VariantCentrifuge performance with large datasets.
Hardware Recommendations¶
Memory: 16GB+ RAM for large VCF files
Storage: SSD for intermediate files
CPU: Multi-core for parallel processing
Performance Features¶
bcftools Pre-filtering¶
The --bcftools-prefilter
option allows you to apply fast bcftools filters during the initial variant extraction step, significantly reducing the amount of data processed by subsequent steps:
# Filter out low-quality and common variants early
variantcentrifuge \
--gene-file genes.txt \
--vcf-file large_cohort.vcf.gz \
--bcftools-prefilter 'FILTER="PASS" && QUAL>20 && INFO/AC<10' \
--preset rare,coding \
--output-file results.tsv
Benefits:
Reduces memory usage by filtering early in the pipeline
Speeds up SnpSift filtering and field extraction
Particularly effective for large cohort VCFs
bcftools Filter Syntax:
FILTER="PASS"
- Only PASS variantsQUAL>20
- Quality score thresholdINFO/AC<10
- Allele count less than 10INFO/AF<0.01
- Allele frequency less than 1%Combine with
&&
(AND) or||
(OR)
Optimized Filtering Workflow¶
For maximum performance with large datasets:
# Combine bcftools pre-filter with late filtering
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file cohort_1000_samples.vcf.gz \
--bcftools-prefilter 'FILTER="PASS" && INFO/AC<20' \
--late-filtering \
--preset rare,coding \
--scoring-config-path scoring/my_scores \
--filters "score > 0.5" \
--output-file high_score_rare.tsv
Large Dataset Strategies¶
Chromosome-based Processing¶
# Split by chromosome
for chr in {1..22} X Y; do
bcftools view -r chr${chr} large_cohort.vcf.gz | \
variantcentrifuge \
--gene-file genes.txt \
--vcf-file /dev/stdin \
--output-file chr${chr}_results.tsv &
done
wait
# Merge results
cat chr*_results.tsv > combined_results.tsv
Memory Management¶
# Increase Java heap size
export JAVA_OPTS="-Xmx16g"
# Use streaming for large files
bcftools view large.vcf.gz | variantcentrifuge --vcf-file /dev/stdin ...