# Usage Guide ## Basic Usage The most basic command to run VariantCentrifuge: ```bash variantcentrifuge \ --gene-name BICC1 \ --vcf-file path/to/your.vcf \ --output-file output.tsv ``` ## Command Line Options ### Required Arguments - `--vcf-file` - Input VCF file (can be compressed with gzip) - `--output-file` - Output TSV file path ### Gene Selection (choose one) - `--gene-name GENE` - Single gene name - `--gene-file GENES.TXT` - File containing multiple genes (one per line) ### Configuration - `--config CONFIG_FILE` - Load custom parameters from JSON config file - `--reference REFERENCE` - snpEff reference database (overrides config) - `--filters "FILTER_EXPRESSION"` - Custom SnpSift filters (overrides config) - `--fields "FIELD_LIST"` - Custom fields to extract (overrides config) ### Filtering Options - `--bcftools-prefilter "EXPRESSION"` - Apply bcftools pre-filter during variant extraction for performance - `--late-filtering` - Apply SnpSift filters after scoring and annotation (allows filtering on computed columns) - `--final-filter "EXPRESSION"` - Apply pandas query expression on final results (filter on any column including scores) ### Input/Output Options - `--samples-file SAMPLES.TXT` - Sample ID mapping for genotype replacement - `--phenotype-file PHENO.TSV` - Phenotype data file - `--phenotype-sample-column` - Column name for sample IDs in phenotype file - `--phenotype-value-column` - Column name for phenotype values - `--xlsx` - Convert final output TSV to XLSX format - `--keep-intermediates` - Retain intermediate files after successful run - `--archive-results` - Create a compressed tar.gz archive of the entire results directory after pipeline completion ### Analysis Options - `--perform-gene-burden` - Run gene burden analysis - `--html-report` - Generate interactive HTML report - `--igv` - Enable IGV.js integration (requires additional options) - `--bam-mapping-file` - TSV/CSV file mapping sample IDs to BAM files - `--igv-reference` - Genome reference for IGV (e.g., 'hg19', 'hg38') ### Scoring Options - `--scoring-config-path` - Path to scoring configuration directory containing variable_assignment_config.json and formula_config.json ### Annotation Options - `--annotate-bed BED_FILE` - Annotate variants with genomic regions from BED files (can specify multiple) - `--annotate-gene-list GENE_LIST` - Check if variants affect genes in custom gene lists (can specify multiple) - `--annotate-json-genes JSON_FILE` - Annotate variants with gene information from JSON file - `--json-gene-mapping MAPPING` - Specify JSON field mapping for gene annotations (required with --annotate-json-genes) - `--json-genes-as-columns` - Output JSON gene data as separate columns instead of appending to Custom_Annotation column ### Other Options - `--version` - Show version and exit - `--help` - Show help message ## Examples ### Basic Gene Analysis ```bash variantcentrifuge \ --gene-name BRCA1 \ --vcf-file samples.vcf.gz \ --output-file brca1_variants.tsv ``` ### Multiple Genes with Custom Filters ```bash variantcentrifuge \ --gene-file cancer_genes.txt \ --vcf-file samples.vcf.gz \ --filters "(( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))" \ --output-file cancer_variants.tsv \ --xlsx ``` ### Comprehensive Analysis with Reports ```bash variantcentrifuge \ --gene-name BRCA1 \ --vcf-file samples.vcf.gz \ --samples-file sample_mapping.txt \ --phenotype-file patient_data.tsv \ --phenotype-sample-column "sample_id" \ --phenotype-value-column "disease_status" \ --perform-gene-burden \ --html-report \ --xlsx \ --output-file brca1_analysis.tsv ``` ### IGV Integration ```bash variantcentrifuge \ --gene-name TP53 \ --vcf-file samples.vcf.gz \ --igv \ --bam-mapping-file bam_files.tsv \ --igv-reference hg38 \ --html-report \ --output-file tp53_variants.tsv ``` ### Variant Scoring ```bash # Apply custom scoring model to variants variantcentrifuge \ --gene-file kidney_genes.txt \ --vcf-file patient.vcf.gz \ --scoring-config-path scoring/nephro_variant_score \ --preset rare,coding \ --html-report \ --output-file scored_variants.tsv ``` ### Custom Annotations ```bash # Annotate with JSON gene information variantcentrifuge \ --gene-name BRCA1 \ --vcf-file samples.vcf.gz \ --annotate-json-genes gene_metadata.json \ --json-gene-mapping '{"identifier":"gene_symbol","dataFields":["panel","inheritance","function"]}' \ --output-file annotated_variants.tsv # Multiple annotation sources variantcentrifuge \ --gene-file cancer_genes.txt \ --vcf-file samples.vcf.gz \ --annotate-bed cancer_hotspots.bed \ --annotate-gene-list actionable_genes.txt \ --annotate-json-genes gene_panels.json \ --json-gene-mapping '{"identifier":"symbol","dataFields":["panel_name","evidence_level"]}' \ --html-report \ --output-file multi_annotated.tsv ``` ### Advanced Filtering ```bash # Pre-filter with bcftools for performance (e.g., only PASS variants with AC < 10) variantcentrifuge \ --gene-file large_gene_list.txt \ --vcf-file large_cohort.vcf.gz \ --bcftools-prefilter 'FILTER="PASS" && INFO/AC<10' \ --preset rare,coding \ --output-file filtered_variants.tsv # Late filtering to filter on computed scores variantcentrifuge \ --gene-name BRCA1 \ --vcf-file samples.vcf.gz \ --scoring-config-path scoring/nephro_variant_score \ --late-filtering \ --filters "inheritance_score > 0.5" \ --output-file high_score_variants.tsv # Final filter using pandas query syntax variantcentrifuge \ --gene-file cancer_genes.txt \ --vcf-file samples.vcf.gz \ --preset rare,coding \ --scoring-config-path scoring/cancer_variant_score \ --final-filter 'pathogenicity_score > 0.8 and IMPACT == "HIGH"' \ --output-file high_priority_variants.tsv # Complex final filter with inheritance patterns variantcentrifuge \ --gene-name BRCA1 \ --vcf-file trio.vcf.gz \ --ped family.ped \ --inheritance-mode columns \ --final-filter 'Inheritance_Pattern in ["de_novo", "compound_heterozygous"] and Inheritance_Confidence > 0.8' \ --output-file denovo_and_compound_het.tsv ``` ## Input File Formats ### VCF Files - Standard VCF format (v4.0 or later) - Can be compressed with gzip (.vcf.gz) - Should be annotated with snpEff for optimal functionality ### Gene Files Text file with one gene name per line: ``` BRCA1 BRCA2 TP53 ATM ``` ### Sample Mapping Files Tab-separated file for genotype replacement: ``` original_id new_id sample_001 Patient_A sample_002 Patient_B sample_003 Control_001 ``` ### Phenotype Files Tab or comma-separated file with sample information: ``` sample_id disease_status age sex Patient_A case 45 F Patient_B case 52 M Control_001 control 48 F ``` ### BAM Mapping Files For IGV integration, provide a mapping from sample IDs to BAM file paths: ``` sample_id bam_path Patient_A /path/to/patient_a.bam Patient_B /path/to/patient_b.bam Control_001 /path/to/control_001.bam ``` ### JSON Gene Files For gene annotation, provide a JSON file containing an array of gene objects: ```json [ { "gene_symbol": "BRCA1", "panel": "HereditaryCancer", "inheritance": "AD", "function": "DNA repair" }, { "gene_symbol": "TP53", "panel": "HereditaryCancer", "inheritance": "AD", "function": "Tumor suppressor" } ] ``` The `--json-gene-mapping` parameter specifies: - `identifier`: The field containing the gene symbol (e.g., "gene_symbol") - `dataFields`: Array of fields to include as annotations (e.g., ["panel", "inheritance", "function"]) ## Output Files ### Main Output - **TSV file** - Tab-separated variant table with extracted fields - **XLSX file** - Excel format (if `--xlsx` specified) - **Metadata file** - Analysis parameters and tool versions ### Optional Outputs - **HTML report** - Interactive variant browser (if `--html-report` specified) - **IGV reports** - Individual variant visualization (if `--igv` specified) - **Gene burden results** - Statistical analysis (if `--perform-gene-burden` specified) ## Configuration See the [Configuration Guide](configuration.md) for detailed information about setting up configuration files and customizing VariantCentrifuge behavior. ## Troubleshooting ### Common Issues 1. **No variants found:** - Check that your VCF file contains variants in the specified gene regions - Verify gene names are correct and match your reference annotation - Review filter expressions - they may be too restrictive 2. **External tool errors:** - Ensure all required tools are installed and in PATH - Check that snpEff database matches your VCF reference - Verify file permissions and disk space 3. **Memory issues:** - Large VCF files may require more memory - Consider filtering your VCF file beforehand to reduce size - Use `--keep-intermediates` to debug intermediate file sizes ### Getting Help - Use `variantcentrifuge --help` for command-line options - Check the [API Reference](api/index.md) for detailed function documentation - Report issues on [GitHub](https://github.com/scholl-lab/variantcentrifuge/issues)