Utils Module

The utils module contains common utility functions and external tool integration.

Utility functions module.

Provides helper functions for logging, running commands, checking tool availability, and retrieving tool versions.

variantcentrifuge.utils.run_command(cmd, output_file=None)[source]

Run a shell command and write stdout to output_file if provided, else return stdout.

Parameters:
  • cmd (list of str) – Command and its arguments.

  • output_file (str, optional) – Path to a file where stdout should be written. If None, returns stdout as a string.

Returns:

If output_file is None, returns the command stdout as a string. If output_file is provided, returns output_file after completion.

Return type:

str

Raises:

subprocess.CalledProcessError – If the command returns a non-zero exit code.

variantcentrifuge.utils.normalize_vcf_headers(lines)[source]

Normalize header lines from tools like SnpEff and SnpSift.

By: 1. Removing known prefixes (e.g., “ANN[*].”, “ANN[0].”) 2. Converting indexed genotype fields from format GEN[index].FIELD to FIELD_index

(e.g., “GEN[0].AF” -> “AF_0”, “GEN[1].DP” -> “DP_1”)

Parameters:

lines (List[str]) – A list of lines (e.g., lines from a file) whose first line may contain SnpEff/SnpSift-generated prefixes in column headers.

Returns:

The updated list of lines where the first line has had matching prefixes removed or replaced and indexed fields normalized.

Return type:

List[str]

variantcentrifuge.utils.normalize_snpeff_headers(lines)[source]

Alias for normalize_vcf_headers for backward compatibility.

This function is deprecated, use normalize_vcf_headers instead.

Parameters:

lines (List[str]) – A list of lines (e.g., lines from a file) whose first line may contain SnpEff-generated prefixes in column headers.

Returns:

The updated list of lines with normalized headers.

Return type:

List[str]

variantcentrifuge.utils.check_external_tools()[source]

Check if required external tools are installed and in the PATH.

Tools checked:

  • bcftools

  • snpEff

  • SnpSift

  • bedtools

If any are missing, log an error and exit.

Raises:

SystemExit – If any required tool is missing.

Return type:

None

variantcentrifuge.utils.get_tool_version(tool_name)[source]

Retrieve the version of a given tool.

Supported tools:

  • snpEff

  • bcftools

  • SnpSift

  • bedtools

Parameters:

tool_name (str) – Name of the tool to retrieve version for.

Returns:

Version string or ‘N/A’ if not found or cannot be retrieved.

Return type:

str

variantcentrifuge.utils.sanitize_metadata_field(value)[source]

Sanitize a metadata field by removing tabs and newlines, replacing with spaces for TSV.

Parameters:

value (str) – The value to sanitize.

Returns:

Sanitized string value with no tabs or newlines.

Return type:

str

variantcentrifuge.utils.ensure_fields_in_extract(base_fields_str, extra_fields)[source]

Ensure each item in extra_fields is present in the space-delimited base_fields_str.

Notes

We no longer normalize extra_fields here, so that raw columns like “GEN[*].DP” remain unmodified.

Parameters:
Return type:

str

variantcentrifuge.utils.generate_igv_safe_filename_base(sample_id, chrom, pos, ref, alt, max_allele_len=10, hash_len=6, max_variant_part_len=50)[source]

Generate a safe, shortened filename base for IGV reports to prevent “File name too long” errors.

This function handles long REF/ALT alleles by truncating and appending a hash of the original allele to maintain uniqueness. The function returns a base filename (without extension) that is filesystem-safe and should avoid “File name too long” errors.

Parameters:
  • sample_id (str) – Sample identifier

  • chrom (str) – Chromosome name/identifier

  • pos (str or int) – Genomic position

  • ref (str) – Reference allele

  • alt (str) – Alternate allele

  • max_allele_len (int, default=10) – Maximum length for each allele in the filename

  • hash_len (int, default=6) – Length of hash to append when truncating an allele

  • max_variant_part_len (int, default=50) – Maximum length for the variant part of the filename (chr_pos_ref_alt)

Returns:

A safe, shortened filename base

Return type:

str

variantcentrifuge.utils.split_bed_file(input_bed, num_chunks, output_dir)[source]

Split a BED file into a specified number of chunks with roughly equal total base pairs.

Parameters:
  • input_bed (str) – Path to the sorted input BED file.

  • num_chunks (int) – The number of smaller BED files to create.

  • output_dir (str) – The directory where the chunked BED files will be saved.

Returns:

A list of file paths to the created chunked BED files.

Return type:

List[str]