Statistics Module¶
Summary statistics and data aggregation
Statistics module for variantcentrifuge.
Provides functions to compute: - Basic variant-level statistics. - Comprehensive gene-level statistics. - Impact and variant type summaries. - Merging and formatting of these stats.
All functions return DataFrames suitable for further processing.
- variantcentrifuge.stats.compute_basic_stats(df, all_samples)[source]¶
Compute basic statistics about the dataset.
Including: - Number of variants - Number of samples - Number of genes - Het/Hom genotype counts - Variant type and impact counts (if columns are present)
- variantcentrifuge.stats.compute_gene_stats(df)[source]¶
Compute gene-level aggregated stats (sum of proband/control counts and alleles).
- Parameters:
df (pd.DataFrame) – Input variants DataFrame with assigned case/control counts. Expected columns: “GENE”, “proband_count”, “control_count”, “proband_allele_count”, “control_allele_count”.
- Returns:
Gene-level summary DataFrame with columns: GENE, proband_count, control_count, proband_allele_count, control_allele_count.
- Return type:
pd.DataFrame
- variantcentrifuge.stats.compute_impact_summary(df)[source]¶
Compute a per-gene impact summary if the “IMPACT” column exists.
- Parameters:
df (pd.DataFrame) – Input DataFrame with “GENE” and “IMPACT” columns.
- Returns:
A pivoted table of gene vs. impact counts. Columns for each impact type. If columns are missing, returns an empty DataFrame.
- Return type:
pd.DataFrame
- variantcentrifuge.stats.compute_variant_type_summary(df)[source]¶
Compute a per-gene variant type summary if the “EFFECT” column exists.
- Parameters:
df (pd.DataFrame) – Input DataFrame with “GENE” and “EFFECT” columns.
- Returns:
A pivoted table of gene vs. variant types. Columns for each variant type. If columns are missing, returns an empty DataFrame.
- Return type:
pd.DataFrame
- variantcentrifuge.stats.merge_and_format_stats(gene_stats, impact_summary, variant_type_summary)[source]¶
Merge gene_stats with impact_summary and variant_type_summary into a single DataFrame.
- Parameters:
gene_stats (pd.DataFrame) – DataFrame of gene-level aggregated stats.
impact_summary (pd.DataFrame) – DataFrame of gene vs. impact counts.
variant_type_summary (pd.DataFrame) – DataFrame of gene vs. variant type counts.
- Returns:
Merged DataFrame with all gene-level stats, filling missing values with 0.
- Return type:
pd.DataFrame