Cohort Analysis Guide¶

This guide explains how to perform cohort-level variant analysis using VariantCentrifuge, including aggregating results from multiple samples and generating comprehensive cohort reports.

Overview¶

VariantCentrifuge supports both single-sample and cohort-level analysis workflows. While the main CLI processes individual VCF files, the cohort analysis tools allow you to:

Aggregate results from multiple single-sample analyses
Identify recurrently mutated genes across the cohort
Generate interactive HTML reports for cohort-wide visualization
Apply dynamic filtering and explore variant patterns
Compute cohort-level statistics and gene burden analyses

Cohort Analysis Workflow¶

Step 1: Single-Sample Analysis¶

First, run VariantCentrifuge on each sample in your cohort:

# Example: Process multiple samples
for sample in APA12 APA13 APA15 APA18; do
    variantcentrifuge \
        --gene-file cancer_genes.txt \
        --vcf-file "${sample}_TvsN.filtered.annotated.vcf.gz" \
        --output-file "output/${sample}_variants.tsv" \
        --xlsx \
        --html-report
done

This creates individual analysis results for each sample that can later be aggregated.

Step 2: Cohort Report Generation¶

Use the create_cohort_report.py script to aggregate individual sample results into a comprehensive cohort report.

Prerequisites¶

Python 3 with pandas and Jinja2 installed
Collection of VariantCentrifuge TSV output files
Consistent analysis parameters across all samples

Usage¶

python scripts/create_cohort_report.py \
  --input-pattern 'output/**/*.vc_analysis.tsv' \
  --output-dir 'cohort_analysis' \
  --report-name 'Cancer Gene Cohort Analysis' \
  --sample-regex 'output/([^/]+)_variants'

Command-Line Arguments¶

--input-pattern (Required): Glob pattern to find input TSV files
--output-dir (Required): Directory for output files (created if needed)
--report-name (Optional): Custom report title (default: “Variant Cohort Analysis”)
--sample-regex (Required): Regex to extract sample IDs from file paths

Sample ID Extraction¶

The --sample-regex parameter uses a Python regular expression with one capturing group to extract sample identifiers:

# For file: output/variantcentrifuge_analysis/APA12_TvsN.filtered.annotated/APA12_TvsN.filtered.annotated.vc_analysis.tsv
# Regex: 'variantcentrifuge_analysis/([^/]+)/'
# Captures: APA12_TvsN.filtered.annotated

# For file: results/sample_APA15/variants.tsv  
# Regex: 'sample_([^/]+)/'
# Captures: APA15

Step 3: Interactive Report Features¶

The generated cohort report provides:

Dynamic Filtering¶

Text filters for gene names, variant effects, and sample IDs
Range sliders for numeric fields (allele frequency, CADD scores, etc.)
Multi-select filters for categorical data (impact levels, clinical significance)
Real-time updates of statistics and visualizations

Interactive Visualizations¶

Gene frequency bar chart showing most commonly mutated genes
Clickable gene bars that filter the main variant table
Summary statistics dashboard that updates with applied filters
Exportable data in CSV/Excel formats

High-Performance Table¶

Pagination for large datasets
Column sorting and visibility controls
Global search across all fields
Export functionality for filtered results

Advanced Cohort Analysis¶

Gene Burden Testing¶

For statistical gene burden analysis across the cohort:

# Run gene burden analysis for each sample
variantcentrifuge \
    --gene-file target_genes.txt \
    --vcf-file cohort_merged.vcf.gz \
    --perform-gene-burden \
    --phenotype-file sample_phenotypes.tsv \
    --phenotype-sample-column "sample_id" \
    --phenotype-value-column "case_control_status" \
    --output-file cohort_gene_burden.tsv

Custom Cohort Configurations¶

Create cohort-specific configuration files:

{
  "reference": "GRCh38.99",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_gnomAD_exomes_AF ClinVar_CLNSIG GEN[*].GT",
  "presets": {
    "cohort_rare": "(((dbNSFP_gnomAD_exomes_AF < 0.001) | (na dbNSFP_gnomAD_exomes_AF)) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))",
    "cohort_quality": "(GEN[*].DP >= 20) & (QUAL >= 30)"
  },
  "perform_gene_burden": true,
  "html_report_default_hidden_columns": [
    "QUAL", "FEATUREID", "AA_POS", "AA_LEN"
  ]
}

Batch Processing with Snakemake¶

For large cohorts, use the included Snakemake workflow:

# Configure analysis in snakemake/config_vc.yaml
# Then run the workflow
cd snakemake
snakemake --cores 8 --use-conda

Example Workflows¶

Workflow 1: Cancer Cohort Analysis¶

# Step 1: Process tumor-normal pairs
for sample in TCGA_*; do
    variantcentrifuge \
        --config cancer_config.json \
        --gene-file oncogenes_tsg.txt \
        --vcf-file "${sample}_somatic.vcf.gz" \
        --preset mutect2_TvsN,coding \
        --output-file "results/${sample}_somatic_variants.tsv" \
        --html-report \
        --igv \
        --bam-mapping-file bam_files.tsv
done

# Step 2: Generate cohort report
python scripts/create_cohort_report.py \
    --input-pattern 'results/*_somatic_variants.tsv' \
    --output-dir 'tcga_cohort_report' \
    --report-name 'TCGA Somatic Variants - Oncogenes & TSGs' \
    --sample-regex 'results/([^_]+)_somatic'

Workflow 2: Rare Disease Cohort¶

# Step 1: Process affected individuals
for sample in patient_*; do
    variantcentrifuge \
        --config rare_disease_config.json \
        --gene-file disease_genes.txt \
        --vcf-file "${sample}_filtered.vcf.gz" \
        --preset rare,coding,not_benign \
        --phenotype-file patient_phenotypes.tsv \
        --phenotype-sample-column "sample_id" \
        --phenotype-value-column "affected_status" \
        --output-file "results/${sample}_rare_variants.tsv" \
        --perform-gene-burden \
        --xlsx
done

# Step 2: Aggregate and analyze
python scripts/create_cohort_report.py \
    --input-pattern 'results/*_rare_variants.tsv' \
    --output-dir 'rare_disease_cohort' \
    --report-name 'Rare Disease Cohort Analysis' \
    --sample-regex 'results/(patient_[^_]+)_rare'

Workflow 3: Population Genetics Study¶

# Step 1: Process population samples
for pop in EUR AFR EAS SAS; do
    for sample in ${pop}_*; do
        variantcentrifuge \
            --config population_config.json \
            --gene-file population_genes.txt \
            --vcf-file "${sample}.vcf.gz" \
            --preset 5percent,coding \
            --output-file "results/${pop}/${sample}_variants.tsv"
    done
done

# Step 2: Generate population-specific reports
for pop in EUR AFR EAS SAS; do
    python scripts/create_cohort_report.py \
        --input-pattern "results/${pop}/*_variants.tsv" \
        --output-dir "cohort_reports/${pop}" \
        --report-name "${pop} Population Variant Analysis" \
        --sample-regex "results/${pop}/([^_]+)_variants"
done

Output Structure¶

The cohort analysis generates the following output structure:

cohort_analysis/
├── cohort_report.html          # Interactive HTML report
└── data/
    ├── variants.json          # Complete variant dataset
    ├── summary_stats.json     # Cohort-wide statistics
    └── gene_frequencies.json  # Gene mutation frequencies

Cohort Report Features¶

Summary Dashboard¶

Total variants and samples
Most frequently mutated genes
Impact distribution
Quality metrics

Interactive Variant Table¶

All variants from all samples
Sample-specific annotations
Filterable by any column
Exportable in multiple formats

Gene-Level Analysis¶

Mutation frequency per gene
Sample distribution per gene
Statistical significance testing
Visual gene ranking

Best Practices¶

Data Consistency¶

Use identical configurations across all samples in a cohort
Apply consistent quality filters before VariantCentrifuge analysis
Ensure sample naming consistency for proper aggregation
Validate annotation completeness across all input files

Performance Optimization¶

Process samples in parallel when possible
Use appropriate hardware resources for large cohorts
Implement checkpointing for long-running analyses
Monitor disk space for intermediate files

Quality Control¶

Validate sample identities before aggregation
Check for batch effects across processing groups
Compare individual vs cohort statistics for consistency
Review outlier samples that may indicate processing issues

Report Customization¶

Customize report names to reflect study context
Hide irrelevant columns in the configuration
Adjust filtering ranges based on your data distribution
Document analysis parameters for reproducibility

Troubleshooting¶

Common Issues¶

Empty cohort report

Check file path patterns and regex
Verify TSV file formats and headers
Ensure sample ID extraction is working

Performance issues with large cohorts

Consider splitting into smaller batches
Increase system memory allocation
Use pagination in the HTML report

Inconsistent results across samples

Verify identical analysis configurations
Check for annotation version differences
Validate reference genome consistency

Debugging Commands¶

# Test sample ID extraction
python -c "
import re
pattern = 'results/([^/]+)_variants'
test_path = 'results/sample_001_variants.tsv'
match = re.search(pattern, test_path)
print('Sample ID:', match.group(1) if match else 'No match')
"

# Validate TSV file structure
head -1 results/*_variants.tsv | grep -c "CHROM"  # Should equal number of files

# Check variant counts per sample
for file in results/*_variants.tsv; do
    echo "$(basename $file): $(tail -n +2 $file | wc -l) variants"
done

By following this guide, you can effectively perform cohort-level variant analysis and generate comprehensive, interactive reports for your research or clinical studies.