Performance Optimization¶
Tips for optimizing VariantCentrifuge performance with large datasets.
Hardware Recommendations¶
Memory: 16GB+ RAM for large VCF files
Storage: SSD for intermediate files
CPU: Multi-core for parallel processing
Performance Features¶
Checkpoint and Resume System¶
The checkpoint system provides significant performance benefits for long-running analyses:
# Enable checkpoints for long-running analyses
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file cohort_1000_samples.vcf.gz \
--enable-checkpoint \
--threads 16 \
--preset rare,coding \
--output-file results.tsv
# Resume if interrupted
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file cohort_1000_samples.vcf.gz \
--enable-checkpoint \
--resume \
--threads 16 \
--preset rare,coding \
--output-file results.tsv
Benefits:
Avoid re-running expensive stages after interruptions
Develop iteratively by resuming from specific stages
Optimize report generation without reprocessing data
Reduce computational waste in cluster environments
See the Resume System documentation for detailed information.
bcftools Pre-filtering¶
The --bcftools-prefilter
option allows you to apply fast bcftools filters during the initial variant extraction step, significantly reducing the amount of data processed by subsequent steps:
# Filter out low-quality and common variants early
variantcentrifuge \
--gene-file genes.txt \
--vcf-file large_cohort.vcf.gz \
--bcftools-prefilter 'FILTER="PASS" && QUAL>20 && INFO/AC<10' \
--preset rare,coding \
--output-file results.tsv
Benefits:
Reduces memory usage by filtering early in the pipeline
Speeds up SnpSift filtering and field extraction
Particularly effective for large cohort VCFs
bcftools Filter Syntax:
FILTER="PASS"
- Only PASS variantsQUAL>20
- Quality score thresholdINFO/AC<10
- Allele count less than 10INFO/AF<0.01
- Allele frequency less than 1%Combine with
&&
(AND) or||
(OR)
Optimized Filtering Workflow¶
For maximum performance with large datasets:
# Combine bcftools pre-filter with late filtering
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file cohort_1000_samples.vcf.gz \
--bcftools-prefilter 'FILTER="PASS" && INFO/AC<20' \
--late-filtering \
--preset rare,coding \
--scoring-config-path scoring/my_scores \
--filters "score > 0.5" \
--output-file high_score_rare.tsv
Large Dataset Strategies¶
Chromosome-based Processing¶
# Split by chromosome
for chr in {1..22} X Y; do
bcftools view -r chr${chr} large_cohort.vcf.gz | \
variantcentrifuge \
--gene-file genes.txt \
--vcf-file /dev/stdin \
--output-file chr${chr}_results.tsv &
done
wait
# Merge results
cat chr*_results.tsv > combined_results.tsv
Memory Management¶
# Increase Java heap size
export JAVA_OPTS="-Xmx16g"
# Use streaming for large files
bcftools view large.vcf.gz | variantcentrifuge --vcf-file /dev/stdin ...
Storage Optimization¶
Archive Results¶
For easy storage and transfer of analysis results, use the --archive-results
option to create a compressed archive:
# Create timestamped archive after analysis
variantcentrifuge \
--gene-file genes.txt \
--vcf-file input.vcf.gz \
--html-report \
--xlsx \
--archive-results \
--output-file results.tsv
# Result: variantcentrifuge_results_input_20241217_143052.tar.gz
Benefits:
Automatic compression reduces storage requirements by 50-80%
Timestamped archives for version tracking
Single file for easy transfer and backup
Archive placed outside output directory to avoid recursion
Intermediate File Management¶
# Delete intermediate files automatically (default behavior)
variantcentrifuge ... # intermediates deleted after success
# Keep intermediates for debugging
variantcentrifuge --keep-intermediates ...
# Combine with archiving for complete workflow
variantcentrifuge \
--gene-file genes.txt \
--vcf-file input.vcf.gz \
--archive-results \ # Archive everything
--output-file results.tsv # Then clean intermediates
Checkpoint and Resume¶
VariantCentrifuge includes a checkpoint system that tracks pipeline progress and allows resumption after interruptions. This is particularly useful for long-running analyses on large datasets.
Basic Checkpoint Usage¶
Enable checkpointing with the --enable-checkpoint
flag:
# Run with checkpoint tracking
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file large_cohort.vcf.gz \
--enable-checkpoint \
--threads 8 \
--output-file results.tsv
If the pipeline is interrupted (e.g., system crash, job timeout), resume from the last checkpoint:
# Resume from checkpoint
variantcentrifuge \
--gene-file large_gene_list.txt \
--vcf-file large_cohort.vcf.gz \
--enable-checkpoint \
--resume \
--threads 8 \
--output-file results.tsv
How Checkpointing Works¶
The checkpoint system:
Tracks each major pipeline step (gene BED creation, variant extraction, filtering, etc.)
Saves state to
.variantcentrifuge_state.json
in the output directoryRecords input/output files, parameters, and completion status for each step
Validates configuration hasn’t changed between runs
Skips already-completed steps when resuming
Checkpoint Features¶
Parallel Processing Support¶
Checkpoints work seamlessly with parallel processing (--threads
):
Tracks individual chunk processing in parallel runs
Properly handles the merge step after parallel processing
Maintains thread-safe state updates
File Integrity Checking¶
Optional file checksum validation ensures data integrity:
# Enable checksum validation (slower but more reliable)
variantcentrifuge \
--gene-file genes.txt \
--vcf-file input.vcf.gz \
--enable-checkpoint \
--checkpoint-checksum \
--output-file results.tsv
Status Inspection¶
Check the status of a previous run without resuming:
# Show checkpoint status
variantcentrifuge \
--show-checkpoint-status \
--output-dir previous_analysis/
Tracked Pipeline Steps¶
The checkpoint system tracks these major steps:
Gene BED creation - Converting gene names to genomic regions
Parallel processing (if using
--threads
):BED file splitting
Individual chunk processing
Chunk merging
TSV sorting - Sorting extracted variants by gene
Genotype replacement - Converting genotypes to sample IDs
Phenotype integration - Adding phenotype data
Variant analysis - Computing statistics and scores
Final output - Writing results and reports
Best Practices¶
Use for long-running analyses: Particularly beneficial for:
Large cohort VCFs (>100 samples)
Extensive gene lists (>100 genes)
Complex scoring and annotation pipelines
Combine with job schedulers: Ideal for HPC environments:
#!/bin/bash #SBATCH --time=24:00:00 #SBATCH --mem=32G variantcentrifuge \ --gene-file all_genes.txt \ --vcf-file cohort.vcf.gz \ --enable-checkpoint \ --resume \ # Safe to use even on first run --threads 16 \ --output-file results.tsv
Monitor progress: The log shows which steps are skipped:
INFO: Skipping gene BED creation (already completed) INFO: Skipping chunk merging (already completed) INFO: Resuming from genotype replacement...
Limitations¶
Configuration must remain identical between runs (same filters, fields, etc.)
Pipeline version must match (no resume across VariantCentrifuge updates)
Intermediate files must not be manually modified
Output directory structure must remain intact
Troubleshooting Checkpoint Issues¶
“Configuration has changed, cannot resume”
Ensure all command-line arguments match the original run
Check that config files haven’t been modified
“Output file validation failed”
An intermediate file may have been corrupted or deleted
Remove
.variantcentrifuge_state.json
to start freshUse
--checkpoint-checksum
for better validation
“Pipeline version mismatch”
VariantCentrifuge was updated between runs
Complete the analysis with the original version or start fresh