Helpers Module¶
Data manipulation and processing helpers
Helper functions for analyze_variants and related processes.
Provides: - Case/control assignment - Phenotype map building - Genotype parsing and allele count conversion - Sample and phenotype classification logic
- variantcentrifuge.helpers.check_file(file_path, exit_on_error=True)[source]¶
Check if a file exists and is readable.
- Parameters:
- Returns:
True if the file exists and is readable, False otherwise
- Return type:
- Raises:
SystemExit – If exit_on_error is True and the file does not exist
- variantcentrifuge.helpers.determine_case_control_sets(all_samples, cfg, df)[source]¶
Determine case/control sample sets based on configuration.
Logic: - If explicit case/control samples are provided, use them directly. - Else if phenotype terms are given, classify samples based on those using phenotypes from the phenotype file. - Else, all samples become controls.
- Parameters:
cfg (dict) – Configuration dictionary that may include “case_samples”, “control_samples”, “case_phenotypes”, “control_phenotypes”, and “phenotypes” from the phenotype file.
df (pd.DataFrame) – DataFrame with variant information (not used for phenotype assignment anymore).
- Returns:
(set_of_case_samples, set_of_control_samples)
- Return type:
- Raises:
SystemExit – If no valid configuration for assigning case/control sets can be determined.
- variantcentrifuge.helpers.build_sample_phenotype_map(df)[source]¶
Build a map of sample -> set_of_phenotypes from the ‘phenotypes’ column if present.
If multiple phenotype groups appear and multiple samples in GT are present: - If counts match, assign phenotypes group-wise. - Otherwise, skip phenotype assignment for that row to prevent incorrect inflation.
- variantcentrifuge.helpers.assign_case_control_counts(df, case_samples, control_samples, all_samples)[source]¶
Assign case/control counts, allele counts, and homozygous variant counts per variant.
Creates columns: - proband_count/control_count: total number of case/control samples - proband_variant_count/control_variant_count: number of case/control samples with a variant allele - proband_allele_count/control_allele_count: sum of variant alleles in case/control samples - proband_homozygous_count/control_homozygous_count: how many case/control samples have a homozygous variant (1/1)
- Parameters:
- Returns:
DataFrame with assigned case/control counts and alleles, including homozygous counts.
- Return type:
pd.DataFrame
- variantcentrifuge.helpers.extract_sample_and_genotype(sample_field)[source]¶
Extract sample name and genotype from a field like ‘sample(0/1:172:110,62)’ or ‘sample(0/1)’.
If parentheses are missing, assume no genotype is specified -> no variant (0/0).
The genotype portion may include extra fields after a colon (e.g. coverage), so we split on the first colon to isolate the actual genotype string (e.g., ‘0/1’).
- variantcentrifuge.helpers.genotype_to_allele_count(genotype)[source]¶
Convert genotype string to allele count.
‘1/1’ -> 2
‘0/1’ or ‘1/0’ -> 1
‘0/0’ or ‘’ -> 0
- variantcentrifuge.helpers.load_gene_list(file_path)[source]¶
Load genes from a file (one gene per line) into a set of uppercase gene names.
- variantcentrifuge.helpers.annotate_variants_with_gene_lists(lines, gene_list_files)[source]¶
Add new columns to variant data lines indicating membership in provided gene lists.
For each gene list file, a new column is added to the TSV file. The column will contain ‘yes’ if any of the genes in the GENE column (which may be comma-separated) is present in the gene list, and ‘no’ otherwise.
- Parameters:
- Returns:
The input lines with additional columns for each gene list
- Return type:
List[str]
Notes
If the GENE column is missing, an error is logged and the input lines are returned unchanged
Multiple genes in the GENE column can be separated by commas, semicolons, or spaces
If a gene list file fails to load, it is skipped with an error message
Column names are sanitized from the file basename, duplicates are suffixed with numbers
- variantcentrifuge.helpers.dump_df_to_xlsx(df, output_path, sheet_name='Sheet1', index=False)[source]¶
Export a pandas DataFrame to an Excel (xlsx) file.
- Parameters:
df (pd.DataFrame) – The DataFrame to export
output_path (str or Path) – Path where the Excel file will be saved
sheet_name (str, optional) – Name of the worksheet in the Excel file, by default “Sheet1”
index (bool, optional) – Whether to include the DataFrame’s index in the Excel file, by default False
- Returns:
The function writes the DataFrame to an Excel file and logs the action
- Return type:
None
- variantcentrifuge.helpers.extract_gencode_id(gene_name)[source]¶
Extract the GENCODE ID (ENSG) from a gene name string if present.
- variantcentrifuge.helpers.get_vcf_regions(vcf_file)[source]¶
Get the chromosomal regions covered in a VCF file.
- variantcentrifuge.helpers.get_vcf_samples(vcf_file)[source]¶
Get a set of sample names from a VCF file.
- variantcentrifuge.helpers.match_IGV_link_columns(header_fields)[source]¶
Find columns in a TSV header that should contain IGV links.