Cohort Analysis Guide

This guide explains how to perform cohort-level variant analysis using VariantCentrifuge, including aggregating results from multiple samples and generating comprehensive cohort reports.

Overview

VariantCentrifuge supports both single-sample and cohort-level analysis workflows. While the main CLI processes individual VCF files, the cohort analysis tools allow you to:

  • Aggregate results from multiple single-sample analyses

  • Identify recurrently mutated genes across the cohort

  • Generate interactive HTML reports for cohort-wide visualization

  • Apply dynamic filtering and explore variant patterns

  • Compute cohort-level statistics and gene burden analyses

Cohort Analysis Workflow

Step 1: Single-Sample Analysis

First, run VariantCentrifuge on each sample in your cohort:

# Example: Process multiple samples
for sample in APA12 APA13 APA15 APA18; do
    variantcentrifuge \
        --gene-file cancer_genes.txt \
        --vcf-file "${sample}_TvsN.filtered.annotated.vcf.gz" \
        --output-file "output/${sample}_variants.tsv" \
        --xlsx \
        --html-report
done

This creates individual analysis results for each sample that can later be aggregated.

Step 2: Cohort Report Generation

Use the create_cohort_report.py script to aggregate individual sample results into a comprehensive cohort report.

Prerequisites

  • Python 3 with pandas and Jinja2 installed

  • Collection of VariantCentrifuge TSV output files

  • Consistent analysis parameters across all samples

Usage

python scripts/create_cohort_report.py \
  --input-pattern 'output/**/*.vc_analysis.tsv' \
  --output-dir 'cohort_analysis' \
  --report-name 'Cancer Gene Cohort Analysis' \
  --sample-regex 'output/([^/]+)_variants'

Command-Line Arguments

  • --input-pattern (Required): Glob pattern to find input TSV files

  • --output-dir (Required): Directory for output files (created if needed)

  • --report-name (Optional): Custom report title (default: “Variant Cohort Analysis”)

  • --sample-regex (Required): Regex to extract sample IDs from file paths

Sample ID Extraction

The --sample-regex parameter uses a Python regular expression with one capturing group to extract sample identifiers:

# For file: output/variantcentrifuge_analysis/APA12_TvsN.filtered.annotated/APA12_TvsN.filtered.annotated.vc_analysis.tsv
# Regex: 'variantcentrifuge_analysis/([^/]+)/'
# Captures: APA12_TvsN.filtered.annotated

# For file: results/sample_APA15/variants.tsv  
# Regex: 'sample_([^/]+)/'
# Captures: APA15

Step 3: Interactive Report Features

The generated cohort report provides:

Dynamic Filtering

  • Text filters for gene names, variant effects, and sample IDs

  • Range sliders for numeric fields (allele frequency, CADD scores, etc.)

  • Multi-select filters for categorical data (impact levels, clinical significance)

  • Real-time updates of statistics and visualizations

Interactive Visualizations

  • Gene frequency bar chart showing most commonly mutated genes

  • Clickable gene bars that filter the main variant table

  • Summary statistics dashboard that updates with applied filters

  • Exportable data in CSV/Excel formats

High-Performance Table

  • Pagination for large datasets

  • Column sorting and visibility controls

  • Global search across all fields

  • Export functionality for filtered results

Advanced Cohort Analysis

Gene Burden Testing

For statistical gene burden analysis across the cohort:

# Run gene burden analysis for each sample
variantcentrifuge \
    --gene-file target_genes.txt \
    --vcf-file cohort_merged.vcf.gz \
    --perform-gene-burden \
    --phenotype-file sample_phenotypes.tsv \
    --phenotype-sample-column "sample_id" \
    --phenotype-value-column "case_control_status" \
    --output-file cohort_gene_burden.tsv

Custom Cohort Configurations

Create cohort-specific configuration files:

{
  "reference": "GRCh38.99",
  "filters": "",
  "fields_to_extract": "CHROM POS REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_gnomAD_exomes_AF ClinVar_CLNSIG GEN[*].GT",
  "presets": {
    "cohort_rare": "(((dbNSFP_gnomAD_exomes_AF < 0.001) | (na dbNSFP_gnomAD_exomes_AF)) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')))",
    "cohort_quality": "(GEN[*].DP >= 20) & (QUAL >= 30)"
  },
  "perform_gene_burden": true,
  "html_report_default_hidden_columns": [
    "QUAL", "FEATUREID", "AA_POS", "AA_LEN"
  ]
}

Batch Processing with Snakemake

For large cohorts, use the included Snakemake workflow:

# Configure analysis in snakemake/config_vc.yaml
# Then run the workflow
cd snakemake
snakemake --cores 8 --use-conda

Example Workflows

Workflow 1: Cancer Cohort Analysis

# Step 1: Process tumor-normal pairs
for sample in TCGA_*; do
    variantcentrifuge \
        --config cancer_config.json \
        --gene-file oncogenes_tsg.txt \
        --vcf-file "${sample}_somatic.vcf.gz" \
        --preset mutect2_TvsN,coding \
        --output-file "results/${sample}_somatic_variants.tsv" \
        --html-report \
        --igv \
        --bam-mapping-file bam_files.tsv
done

# Step 2: Generate cohort report
python scripts/create_cohort_report.py \
    --input-pattern 'results/*_somatic_variants.tsv' \
    --output-dir 'tcga_cohort_report' \
    --report-name 'TCGA Somatic Variants - Oncogenes & TSGs' \
    --sample-regex 'results/([^_]+)_somatic'

Workflow 2: Rare Disease Cohort

# Step 1: Process affected individuals
for sample in patient_*; do
    variantcentrifuge \
        --config rare_disease_config.json \
        --gene-file disease_genes.txt \
        --vcf-file "${sample}_filtered.vcf.gz" \
        --preset rare,coding,not_benign \
        --phenotype-file patient_phenotypes.tsv \
        --phenotype-sample-column "sample_id" \
        --phenotype-value-column "affected_status" \
        --output-file "results/${sample}_rare_variants.tsv" \
        --perform-gene-burden \
        --xlsx
done

# Step 2: Aggregate and analyze
python scripts/create_cohort_report.py \
    --input-pattern 'results/*_rare_variants.tsv' \
    --output-dir 'rare_disease_cohort' \
    --report-name 'Rare Disease Cohort Analysis' \
    --sample-regex 'results/(patient_[^_]+)_rare'

Workflow 3: Population Genetics Study

# Step 1: Process population samples
for pop in EUR AFR EAS SAS; do
    for sample in ${pop}_*; do
        variantcentrifuge \
            --config population_config.json \
            --gene-file population_genes.txt \
            --vcf-file "${sample}.vcf.gz" \
            --preset 5percent,coding \
            --output-file "results/${pop}/${sample}_variants.tsv"
    done
done

# Step 2: Generate population-specific reports
for pop in EUR AFR EAS SAS; do
    python scripts/create_cohort_report.py \
        --input-pattern "results/${pop}/*_variants.tsv" \
        --output-dir "cohort_reports/${pop}" \
        --report-name "${pop} Population Variant Analysis" \
        --sample-regex "results/${pop}/([^_]+)_variants"
done

Output Structure

The cohort analysis generates the following output structure:

cohort_analysis/
├── cohort_report.html          # Interactive HTML report
└── data/
    ├── variants.json          # Complete variant dataset
    ├── summary_stats.json     # Cohort-wide statistics
    └── gene_frequencies.json  # Gene mutation frequencies

Cohort Report Features

Summary Dashboard

  • Total variants and samples

  • Most frequently mutated genes

  • Impact distribution

  • Quality metrics

Interactive Variant Table

  • All variants from all samples

  • Sample-specific annotations

  • Filterable by any column

  • Exportable in multiple formats

Gene-Level Analysis

  • Mutation frequency per gene

  • Sample distribution per gene

  • Statistical significance testing

  • Visual gene ranking

Best Practices

Data Consistency

  1. Use identical configurations across all samples in a cohort

  2. Apply consistent quality filters before VariantCentrifuge analysis

  3. Ensure sample naming consistency for proper aggregation

  4. Validate annotation completeness across all input files

Performance Optimization

  1. Process samples in parallel when possible

  2. Use appropriate hardware resources for large cohorts

  3. Implement checkpointing for long-running analyses

  4. Monitor disk space for intermediate files

Quality Control

  1. Validate sample identities before aggregation

  2. Check for batch effects across processing groups

  3. Compare individual vs cohort statistics for consistency

  4. Review outlier samples that may indicate processing issues

Report Customization

  1. Customize report names to reflect study context

  2. Hide irrelevant columns in the configuration

  3. Adjust filtering ranges based on your data distribution

  4. Document analysis parameters for reproducibility

Troubleshooting

Common Issues

Empty cohort report

  • Check file path patterns and regex

  • Verify TSV file formats and headers

  • Ensure sample ID extraction is working

Performance issues with large cohorts

  • Consider splitting into smaller batches

  • Increase system memory allocation

  • Use pagination in the HTML report

Inconsistent results across samples

  • Verify identical analysis configurations

  • Check for annotation version differences

  • Validate reference genome consistency

Debugging Commands

# Test sample ID extraction
python -c "
import re
pattern = 'results/([^/]+)_variants'
test_path = 'results/sample_001_variants.tsv'
match = re.search(pattern, test_path)
print('Sample ID:', match.group(1) if match else 'No match')
"

# Validate TSV file structure
head -1 results/*_variants.tsv | grep -c "CHROM"  # Should equal number of files

# Check variant counts per sample
for file in results/*_variants.tsv; do
    echo "$(basename $file): $(tail -n +2 $file | wc -l) variants"
done

By following this guide, you can effectively perform cohort-level variant analysis and generate comprehensive, interactive reports for your research or clinical studies.