Helpers Module

Data manipulation and processing helpers

Helper functions for analyze_variants and related processes.

Provides: - Case/control assignment - Phenotype map building - Genotype parsing and allele count conversion - Sample and phenotype classification logic

variantcentrifuge.helpers.check_file(file_path, exit_on_error=True)[source]

Check if a file exists and is readable.

Parameters:
  • file_path (str) – Path to the file to check

  • exit_on_error (bool, optional) – Whether to exit the program if the file does not exist, by default True

Returns:

True if the file exists and is readable, False otherwise

Return type:

bool

Raises:

SystemExit – If exit_on_error is True and the file does not exist

variantcentrifuge.helpers.determine_case_control_sets(all_samples, cfg, df)[source]

Determine case/control sample sets based on configuration.

Logic: - If explicit case/control samples are provided, use them directly. - Else if phenotype terms are given, classify samples based on those using phenotypes from the phenotype file. - Else, all samples become controls.

Parameters:
  • all_samples (set of str) – The full set of sample names.

  • cfg (dict) – Configuration dictionary that may include “case_samples”, “control_samples”, “case_phenotypes”, “control_phenotypes”, and “phenotypes” from the phenotype file.

  • df (pd.DataFrame) – DataFrame with variant information (not used for phenotype assignment anymore).

Returns:

(set_of_case_samples, set_of_control_samples)

Return type:

Tuple[set, set]

Raises:

SystemExit – If no valid configuration for assigning case/control sets can be determined.

variantcentrifuge.helpers.build_sample_phenotype_map(df)[source]

Build a map of sample -> set_of_phenotypes from the ‘phenotypes’ column if present.

If multiple phenotype groups appear and multiple samples in GT are present: - If counts match, assign phenotypes group-wise. - Otherwise, skip phenotype assignment for that row to prevent incorrect inflation.

Parameters:

df (DataFrame)

Return type:

Dict[str, Set[str]]

variantcentrifuge.helpers.assign_case_control_counts(df, case_samples, control_samples, all_samples)[source]

Assign case/control counts, allele counts, and homozygous variant counts per variant.

Creates columns: - proband_count/control_count: total number of case/control samples - proband_variant_count/control_variant_count: number of case/control samples with a variant allele - proband_allele_count/control_allele_count: sum of variant alleles in case/control samples - proband_homozygous_count/control_homozygous_count: how many case/control samples have a homozygous variant (1/1)

Parameters:
  • df (pd.DataFrame) – DataFrame of variants with a “GT” column listing variants per sample.

  • case_samples (set of str) – Set of samples classified as cases.

  • control_samples (set of str) – Set of samples classified as controls.

  • all_samples (set of str) – All samples present in the VCF.

Returns:

DataFrame with assigned case/control counts and alleles, including homozygous counts.

Return type:

pd.DataFrame

variantcentrifuge.helpers.extract_sample_and_genotype(sample_field)[source]

Extract sample name and genotype from a field like ‘sample(0/1:172:110,62)’ or ‘sample(0/1)’.

If parentheses are missing, assume no genotype is specified -> no variant (0/0).

The genotype portion may include extra fields after a colon (e.g. coverage), so we split on the first colon to isolate the actual genotype string (e.g., ‘0/1’).

Parameters:

sample_field (str) – A string like ‘sample(0/1)’, ‘sample(0/1:172:110,62)’, or ‘sample’.

Returns:

(sample_name, genotype) where genotype is typically ‘0/1’, ‘1/1’, ‘0/0’, etc.

Return type:

(str, str)

variantcentrifuge.helpers.genotype_to_allele_count(genotype)[source]

Convert genotype string to allele count.

  • ‘1/1’ -> 2

  • ‘0/1’ or ‘1/0’ -> 1

  • ‘0/0’ or ‘’ -> 0

Parameters:

genotype (str) – Genotype string, expected to be one of ‘0/0’, ‘0/1’, ‘1/0’, ‘1/1’, or ‘’.

Returns:

The allele count for the given genotype.

Return type:

int

variantcentrifuge.helpers.load_gene_list(file_path)[source]

Load genes from a file (one gene per line) into a set of uppercase gene names.

Parameters:

file_path (str) – Path to a file containing gene names, one per line

Returns:

A set of gene names in uppercase for case-insensitive matching

Return type:

Set[str]

variantcentrifuge.helpers.annotate_variants_with_gene_lists(lines, gene_list_files)[source]

Add new columns to variant data lines indicating membership in provided gene lists.

For each gene list file, a new column is added to the TSV file. The column will contain ‘yes’ if any of the genes in the GENE column (which may be comma-separated) is present in the gene list, and ‘no’ otherwise.

Parameters:
  • lines (List[str]) – Lines of the TSV file to annotate

  • gene_list_files (List[str]) – List of file paths containing gene names (one per line)

Returns:

The input lines with additional columns for each gene list

Return type:

List[str]

Notes

  • If the GENE column is missing, an error is logged and the input lines are returned unchanged

  • Multiple genes in the GENE column can be separated by commas, semicolons, or spaces

  • If a gene list file fails to load, it is skipped with an error message

  • Column names are sanitized from the file basename, duplicates are suffixed with numbers

variantcentrifuge.helpers.dump_df_to_xlsx(df, output_path, sheet_name='Sheet1', index=False)[source]

Export a pandas DataFrame to an Excel (xlsx) file.

Parameters:
  • df (pd.DataFrame) – The DataFrame to export

  • output_path (str or Path) – Path where the Excel file will be saved

  • sheet_name (str, optional) – Name of the worksheet in the Excel file, by default “Sheet1”

  • index (bool, optional) – Whether to include the DataFrame’s index in the Excel file, by default False

Returns:

The function writes the DataFrame to an Excel file and logs the action

Return type:

None

variantcentrifuge.helpers.extract_gencode_id(gene_name)[source]

Extract the GENCODE ID (ENSG) from a gene name string if present.

Parameters:

gene_name (str) – Gene name, possibly containing an ENSG ID

Returns:

The ENSG ID if found, otherwise an empty string

Return type:

str

variantcentrifuge.helpers.get_vcf_names(vcf_file)[source]

Get sample names from a VCF file header.

Parameters:

vcf_file (str) – Path to the VCF file

Returns:

List of sample names from the VCF file

Return type:

List[str]

variantcentrifuge.helpers.get_vcf_regions(vcf_file)[source]

Get the chromosomal regions covered in a VCF file.

Parameters:

vcf_file (str) – Path to the VCF file

Returns:

List of regions in format “chr:start-end”

Return type:

List[str]

variantcentrifuge.helpers.get_vcf_samples(vcf_file)[source]

Get a set of sample names from a VCF file.

Parameters:

vcf_file (str) – Path to the VCF file

Returns:

Set of sample names

Return type:

Set[str]

variantcentrifuge.helpers.get_vcf_size(vcf_file)[source]

Get the number of variants in a VCF file.

Parameters:

vcf_file (str) – Path to the VCF file

Returns:

Number of variants in the VCF file

Return type:

int

Find columns in a TSV header that should contain IGV links.

Parameters:

header_fields (List[str]) – List of header field names

Returns:

Mapping of column name to column index for IGV link columns

Return type:

Dict[str, int]

variantcentrifuge.helpers.read_sequencing_manifest(file_path)[source]

Read a sequencing manifest file containing sample metadata.

Expected format: TSV with header row and sample IDs in first column.

Parameters:

file_path (str) – Path to the manifest file

Returns:

Mapping of sample IDs to metadata dictionaries

Return type:

Dict[str, Dict[str, str]]