Statistics Module

Summary statistics and data aggregation

Statistics module for variantcentrifuge.

Provides functions to compute: - Basic variant-level statistics. - Comprehensive gene-level statistics. - Impact and variant type summaries. - Merging and formatting of these stats.

All functions return DataFrames suitable for further processing.

variantcentrifuge.stats.compute_basic_stats(df, all_samples)[source]

Compute basic statistics about the dataset.

Including: - Number of variants - Number of samples - Number of genes - Het/Hom genotype counts - Variant type and impact counts (if columns are present)

Parameters:
  • df (pd.DataFrame) – Input variants DataFrame. Expected to have columns “GENE”, “GT” and optionally “EFFECT”, “IMPACT”.

  • all_samples (set of str) – Set of all sample names.

Returns:

A DataFrame with ‘metric’ and ‘value’ columns listing basic stats.

Return type:

pd.DataFrame

variantcentrifuge.stats.compute_gene_stats(df)[source]

Compute gene-level aggregated stats (sum of proband/control counts and alleles).

Parameters:

df (pd.DataFrame) – Input variants DataFrame with assigned case/control counts. Expected columns: “GENE”, “proband_count”, “control_count”, “proband_allele_count”, “control_allele_count”.

Returns:

Gene-level summary DataFrame with columns: GENE, proband_count, control_count, proband_allele_count, control_allele_count.

Return type:

pd.DataFrame

variantcentrifuge.stats.compute_impact_summary(df)[source]

Compute a per-gene impact summary if the “IMPACT” column exists.

Parameters:

df (pd.DataFrame) – Input DataFrame with “GENE” and “IMPACT” columns.

Returns:

A pivoted table of gene vs. impact counts. Columns for each impact type. If columns are missing, returns an empty DataFrame.

Return type:

pd.DataFrame

variantcentrifuge.stats.compute_variant_type_summary(df)[source]

Compute a per-gene variant type summary if the “EFFECT” column exists.

Parameters:

df (pd.DataFrame) – Input DataFrame with “GENE” and “EFFECT” columns.

Returns:

A pivoted table of gene vs. variant types. Columns for each variant type. If columns are missing, returns an empty DataFrame.

Return type:

pd.DataFrame

variantcentrifuge.stats.merge_and_format_stats(gene_stats, impact_summary, variant_type_summary)[source]

Merge gene_stats with impact_summary and variant_type_summary into a single DataFrame.

Parameters:
  • gene_stats (pd.DataFrame) – DataFrame of gene-level aggregated stats.

  • impact_summary (pd.DataFrame) – DataFrame of gene vs. impact counts.

  • variant_type_summary (pd.DataFrame) – DataFrame of gene vs. variant type counts.

Returns:

Merged DataFrame with all gene-level stats, filling missing values with 0.

Return type:

pd.DataFrame