Extractor Module

VCF field extraction utilities

Field extraction module.

This module provides two field extraction backends: 1. extract_fields_bcftools() - Fast C-based extraction using bcftools query (default) 2. extract_fields_snpsift() - Legacy Java-based extraction using SnpSift (fallback)

variantcentrifuge.extractor.build_bcftools_format_string(fields, vcf_samples=None)[source]

Build bcftools query format string from field list.

Parameters:
  • fields (list[str]) – List of field names to extract (e.g., [“CHROM”, “POS”, “ANN[0].GENE”, “GEN[*].GT”])

  • vcf_samples (list[str], optional) – List of sample names from VCF for per-sample column naming

Returns:

Format string for bcftools query and ordered column name list

Return type:

tuple[str, list[str]]

variantcentrifuge.extractor.parse_ann_subfields(df, fields)[source]

Parse ANN subfields from raw ANN column.

Parameters:
  • df (pd.DataFrame) – DataFrame with raw “ANN” column containing pipe-delimited annotations

  • fields (list[str]) – Original field list to determine which ANN subfields to extract

Returns:

DataFrame with ANN subfields added and raw ANN column removed

Return type:

pd.DataFrame

variantcentrifuge.extractor.parse_nmd_subfields(df, fields)[source]

Parse NMD subfields from raw NMD column.

Parameters:
  • df (pd.DataFrame) – DataFrame with raw “NMD” column containing pipe-delimited data

  • fields (list[str]) – Original field list to determine which NMD subfields to extract

Returns:

DataFrame with NMD subfields added and raw NMD column removed

Return type:

pd.DataFrame

variantcentrifuge.extractor.extract_fields_bcftools(variant_file, fields, cfg, output_file, vcf_samples=None)[source]

Extract specified fields from variant records using bcftools query.

This is 19x faster than SnpSift extractFields for large cohorts.

Parameters:
  • variant_file (str) – Path to the VCF file from which fields should be extracted.

  • fields (str) – A space-separated list of fields to extract.

  • cfg (dict) – Configuration dictionary.

  • output_file (str) – Path to the final TSV file where extracted fields will be written.

  • vcf_samples (list[str], optional) – List of sample names from VCF for per-sample column naming.

Returns:

The output_file path that now contains the extracted fields (TSV).

Return type:

str

Raises:

RuntimeError – If the bcftools query command fails.

variantcentrifuge.extractor.extract_fields_snpsift(variant_file, fields, cfg, output_file)[source]

Extract specified fields from variant records using SnpSift extractFields (legacy).

This is the original Java-based extraction method, kept as a fallback.

Write them directly to output_file, controlling the SnpSift field separator if needed.

Parameters:
  • variant_file (str) – Path to the VCF file from which fields should be extracted.

  • fields (str) – A space-separated list of fields to extract (e.g. “CHROM POS REF ALT DP AD”).

  • cfg (dict) –

    Configuration dictionary that may include tool paths, parameters, etc.

    • ”extract_fields_separator”: str

      The separator for multi-sample fields when using SnpSift -s …. Often a comma “,”. Defaults to “,” if not present.

    • ”debug_level”: str or None

      Optional debug level to control how much we log.

  • output_file (str) – Path to the final TSV file where extracted fields will be written.

Returns:

The same output_file path that now contains the extracted fields (TSV).

Return type:

str

Raises:

RuntimeError – If the field extraction command fails.

variantcentrifuge.extractor.extract_fields(variant_file, fields, cfg, output_file)

Extract specified fields from variant records using SnpSift extractFields (legacy).

This is the original Java-based extraction method, kept as a fallback.

Write them directly to output_file, controlling the SnpSift field separator if needed.

Parameters:
  • variant_file (str) – Path to the VCF file from which fields should be extracted.

  • fields (str) – A space-separated list of fields to extract (e.g. “CHROM POS REF ALT DP AD”).

  • cfg (dict) –

    Configuration dictionary that may include tool paths, parameters, etc.

    • ”extract_fields_separator”: str

      The separator for multi-sample fields when using SnpSift -s …. Often a comma “,”. Defaults to “,” if not present.

    • ”debug_level”: str or None

      Optional debug level to control how much we log.

  • output_file (str) – Path to the final TSV file where extracted fields will be written.

Returns:

The same output_file path that now contains the extracted fields (TSV).

Return type:

str

Raises:

RuntimeError – If the field extraction command fails.