Utils Module

The utils module contains common utility functions and external tool integration.

Utility functions module.

Provides helper functions for logging, running commands, checking tool availability, and retrieving tool versions.

variantcentrifuge.utils.check_external_tools(tools)[source]

Check if external tools are available in PATH.

Parameters:

tools (List[str]) – List of tool names to check for availability

Returns:

True if all tools are available, False otherwise

Return type:

bool

variantcentrifuge.utils.smart_open(filename, mode='r', encoding='utf-8')[source]

Open a file with automatic gzip support based on file extension.

Parameters:
  • filename (str) – Path to the file

  • mode (str) – File opening mode (‘r’, ‘w’, ‘rt’, ‘wt’, etc.)

  • encoding (str) – Text encoding (for text modes)

Returns:

Opened file handle

Return type:

file object

variantcentrifuge.utils.run_command(cmd, output_file=None)[source]

Run a shell command and write stdout to output_file if provided, else return stdout.

Parameters:
  • cmd (list of str) – Command and its arguments.

  • output_file (str, optional) – Path to a file where stdout should be written. If None, returns stdout as a string.

Returns:

If output_file is None, returns the command stdout as a string. If output_file is provided, returns output_file after completion.

Return type:

str

Raises:

subprocess.CalledProcessError – If the command returns a non-zero exit code.

variantcentrifuge.utils.normalize_vcf_headers(lines)[source]

Normalize header lines from tools like SnpEff and SnpSift.

By:

  1. Removing known prefixes (e.g., “ANN[*].”, “ANN[0].”)

  2. Converting indexed genotype fields from format GEN[index].FIELD to FIELD_index (e.g., “GEN[0].AF” -> “AF_0”, “GEN[1].DP” -> “DP_1”)

Parameters:

lines (List[str]) – A list of lines (e.g., lines from a file) whose first line may contain SnpEff/SnpSift-generated prefixes in column headers.

Returns:

The updated list of lines where the first line has had matching prefixes removed or replaced and indexed fields normalized.

Return type:

List[str]

variantcentrifuge.utils.get_tool_version(tool_name)[source]

Retrieve the version of a given tool.

Supported tools:

  • snpEff

  • bcftools

  • SnpSift

  • bedtools

Parameters:

tool_name (str) – Name of the tool to retrieve version for.

Returns:

Version string or ‘N/A’ if not found or cannot be retrieved.

Return type:

str

variantcentrifuge.utils.sanitize_metadata_field(value)[source]

Sanitize a metadata field by removing tabs and newlines, replacing with spaces for TSV.

Parameters:

value (str) – The value to sanitize.

Returns:

Sanitized string value with no tabs or newlines.

Return type:

str

variantcentrifuge.utils.ensure_fields_in_extract(base_fields_str, extra_fields)[source]

Ensure each item in extra_fields is present in the space-delimited base_fields_str.

Notes

We no longer normalize extra_fields here, so that raw columns like “GEN[*].DP” remain unmodified.

Parameters:
Return type:

str

variantcentrifuge.utils.generate_igv_safe_filename_base(sample_id, chrom, pos, ref, alt, max_allele_len=10, hash_len=6, max_variant_part_len=50)[source]

Generate a safe, shortened filename base for IGV reports to prevent “File name too long” errors.

This function handles long REF/ALT alleles by truncating and appending a hash of the original allele to maintain uniqueness. The function returns a base filename (without extension) that is filesystem-safe and should avoid “File name too long” errors.

Parameters:
  • sample_id (str) – Sample identifier

  • chrom (str) – Chromosome name/identifier

  • pos (str or int) – Genomic position

  • ref (str) – Reference allele

  • alt (str) – Alternate allele

  • max_allele_len (int, default=10) – Maximum length for each allele in the filename

  • hash_len (int, default=6) – Length of hash to append when truncating an allele

  • max_variant_part_len (int, default=50) – Maximum length for the variant part of the filename (chr_pos_ref_alt)

Returns:

A safe, shortened filename base

Return type:

str

variantcentrifuge.utils.split_bed_file(input_bed, num_chunks, output_dir)[source]

Split a BED file into a specified number of chunks with roughly equal total base pairs.

Parameters:
  • input_bed (str) – Path to the sorted input BED file.

  • num_chunks (int) – The number of smaller BED files to create.

  • output_dir (str) – The directory where the chunked BED files will be saved.

Returns:

A list of file paths to the created chunked BED files.

Return type:

List[str]

variantcentrifuge.utils.remove_vcf_extensions(filename)[source]

Remove common VCF-related extensions from a filename.

Parameters:

filename (str) – The input filename, possibly ending in .vcf, .vcf.gz, or .gz.

Returns:

The filename base without VCF-related extensions.

Return type:

str

variantcentrifuge.utils.compute_base_name(vcf_path, gene_name)[source]

Compute a base name for output files based on the VCF filename and genes.

If multiple genes are specified, create a hash to represent them. If ‘all’ is specified, append ‘.all’. Otherwise, append the gene name if it’s not already in the VCF base name.

Parameters:
  • vcf_path (str) – Path to the VCF file.

  • gene_name (str) – The normalized gene name string.

Returns:

A base name for output files.

Return type:

str