Helpers Module¶

Data manipulation and processing helpers

Helper functions for analyze_variants and related processes.

Provides: - Case/control assignment - Phenotype map building - Genotype parsing and allele count conversion - Sample and phenotype classification logic

variantcentrifuge.helpers.check_file(file_path, exit_on_error=True)[source]¶

Check if a file exists and is readable.

Parameters:

file_path (str) – Path to the file to check
exit_on_error (bool, optional) – Whether to exit the program if the file does not exist, by default True

Returns:

True if the file exists and is readable, False otherwise

Return type:

bool

Raises:

SystemExit – If exit_on_error is True and the file does not exist

variantcentrifuge.helpers.determine_case_control_sets(all_samples, cfg, df)[source]¶

Determine case/control sample sets based on configuration.

Logic: - If explicit case/control samples are provided, use them directly. - Else if phenotype terms are given, classify samples based on those using phenotypes from the phenotype file. - Else, all samples become controls.

Parameters:

all_samples (set of str) – The full set of sample names.
cfg (dict) – Configuration dictionary that may include “case_samples”, “control_samples”, “case_phenotypes”, “control_phenotypes”, and “phenotypes” from the phenotype file.
df (pd.DataFrame) – DataFrame with variant information (not used for phenotype assignment anymore).

Returns:

(set_of_case_samples, set_of_control_samples)

Return type:

Tuple[set, set]

Raises:

SystemExit – If no valid configuration for assigning case/control sets can be determined.

variantcentrifuge.helpers.build_sample_phenotype_map(df)[source]¶

Build a map of sample -> set_of_phenotypes from the ‘phenotypes’ column if present.

If multiple phenotype groups appear and multiple samples in GT are present: - If counts match, assign phenotypes group-wise. - Otherwise, skip phenotype assignment for that row to prevent incorrect inflation.

Parameters:: df (DataFrame)
Return type:: Dict[str, Set[str]]

variantcentrifuge.helpers.assign_case_control_counts(df, case_samples, control_samples, all_samples)[source]¶

Assign case/control counts, allele counts, and homozygous variant counts per variant.

Creates columns: - proband_count/control_count: total number of case/control samples - proband_variant_count/control_variant_count: number of case/control samples with a variant allele - proband_allele_count/control_allele_count: sum of variant alleles in case/control samples - proband_homozygous_count/control_homozygous_count: how many case/control samples have a homozygous variant (1/1)

Parameters:

df (pd.DataFrame) – DataFrame of variants with a “GT” column listing variants per sample.
case_samples (set of str) – Set of samples classified as cases.
control_samples (set of str) – Set of samples classified as controls.
all_samples (set of str) – All samples present in the VCF.

Returns:

DataFrame with assigned case/control counts and alleles, including homozygous counts.

Return type:

pd.DataFrame

variantcentrifuge.helpers.extract_sample_and_genotype(sample_field)[source]¶

Extract sample name and genotype from a field like ‘sample(0/1:172:110,62)’ or ‘sample(0/1)’.

If parentheses are missing, assume no genotype is specified -> no variant (0/0).

The genotype portion may include extra fields after a colon (e.g. coverage), so we split on the first colon to isolate the actual genotype string (e.g., ‘0/1’).

Parameters:: sample_field (str) – A string like ‘sample(0/1)’, ‘sample(0/1:172:110,62)’, or ‘sample’.
Returns:: (sample_name, genotype) where genotype is typically ‘0/1’, ‘1/1’, ‘0/0’, etc.
Return type:: (str, str)

variantcentrifuge.helpers.genotype_to_allele_count(genotype)[source]¶

Convert genotype string to allele count.

‘1/1’ -> 2
‘0/1’ or ‘1/0’ -> 1
‘0/0’ or ‘’ -> 0

Parameters:: genotype (str) – Genotype string, expected to be one of ‘0/0’, ‘0/1’, ‘1/0’, ‘1/1’, or ‘’.
Returns:: The allele count for the given genotype.
Return type:: int

variantcentrifuge.helpers.load_gene_list(file_path)[source]¶

Load genes from a file (one gene per line) into a set of uppercase gene names.

Parameters:: file_path (str) – Path to a file containing gene names, one per line
Returns:: A set of gene names in uppercase for case-insensitive matching
Return type:: Set[str]

variantcentrifuge.helpers.annotate_variants_with_gene_lists(lines, gene_list_files)[source]¶

Add new columns to variant data lines indicating membership in provided gene lists.

For each gene list file, a new column is added to the TSV file. The column will contain ‘yes’ if any of the genes in the GENE column (which may be comma-separated) is present in the gene list, and ‘no’ otherwise.

Parameters:

lines (List[str]) – Lines of the TSV file to annotate
gene_list_files (List[str]) – List of file paths containing gene names (one per line)

Returns:

The input lines with additional columns for each gene list

Return type:

List[str]

Notes

If the GENE column is missing, an error is logged and the input lines are returned unchanged
Multiple genes in the GENE column can be separated by commas, semicolons, or spaces
If a gene list file fails to load, it is skipped with an error message
Column names are sanitized from the file basename, duplicates are suffixed with numbers

variantcentrifuge.helpers.dump_df_to_xlsx(df, output_path, sheet_name='Sheet1', index=False)[source]¶

Export a pandas DataFrame to an Excel (xlsx) file.

Parameters:

df (pd.DataFrame) – The DataFrame to export
output_path (str or Path) – Path where the Excel file will be saved
sheet_name (str, optional) – Name of the worksheet in the Excel file, by default “Sheet1”
index (bool, optional) – Whether to include the DataFrame’s index in the Excel file, by default False

Returns:

The function writes the DataFrame to an Excel file and logs the action

Return type:

None

variantcentrifuge.helpers.extract_gencode_id(gene_name)[source]¶

Extract the GENCODE ID (ENSG) from a gene name string if present.

Parameters:: gene_name (str) – Gene name, possibly containing an ENSG ID
Returns:: The ENSG ID if found, otherwise an empty string
Return type:: str

variantcentrifuge.helpers.get_vcf_names(vcf_file)[source]¶

Get sample names from a VCF file header.

Parameters:: vcf_file (str) – Path to the VCF file
Returns:: List of sample names from the VCF file
Return type:: List[str]

variantcentrifuge.helpers.get_vcf_regions(vcf_file)[source]¶

Get the chromosomal regions covered in a VCF file.

Parameters:: vcf_file (str) – Path to the VCF file
Returns:: List of regions in format “chr:start-end”
Return type:: List[str]

variantcentrifuge.helpers.get_vcf_samples(vcf_file)[source]¶

Get a set of sample names from a VCF file.

Parameters:: vcf_file (str) – Path to the VCF file
Returns:: Set of sample names
Return type:: Set[str]

variantcentrifuge.helpers.get_vcf_size(vcf_file)[source]¶

Get the number of variants in a VCF file.

Parameters:: vcf_file (str) – Path to the VCF file
Returns:: Number of variants in the VCF file
Return type:: int

variantcentrifuge.helpers.match_IGV_link_columns(header_fields)[source]¶

Find columns in a TSV header that should contain IGV links.

Parameters:: header_fields (List[str]) – List of header field names
Returns:: Mapping of column name to column index for IGV link columns
Return type:: Dict[str, int]

variantcentrifuge.helpers.read_sequencing_manifest(file_path)[source]¶

Read a sequencing manifest file containing sample metadata.

Expected format: TSV with header row and sample IDs in first column.

Parameters:: file_path (str) – Path to the manifest file
Returns:: Mapping of sample IDs to metadata dictionaries
Return type:: Dict[str, Dict[str, str]]