Extractor Module¶
VCF field extraction utilities
Field extraction module.
This module provides two field extraction backends: 1. extract_fields_bcftools() - Fast C-based extraction using bcftools query (default) 2. extract_fields_snpsift() - Legacy Java-based extraction using SnpSift (fallback)
- variantcentrifuge.extractor.build_bcftools_format_string(fields, vcf_samples=None)[source]¶
Build bcftools query format string from field list.
- Parameters:
- Returns:
Format string for bcftools query and ordered column name list
- Return type:
- variantcentrifuge.extractor.parse_ann_subfields(df, fields)[source]¶
Parse ANN subfields from raw ANN column.
- variantcentrifuge.extractor.parse_nmd_subfields(df, fields)[source]¶
Parse NMD subfields from raw NMD column.
- variantcentrifuge.extractor.extract_fields_bcftools(variant_file, fields, cfg, output_file, vcf_samples=None)[source]¶
Extract specified fields from variant records using bcftools query.
This is 19x faster than SnpSift extractFields for large cohorts.
- Parameters:
variant_file (str) – Path to the VCF file from which fields should be extracted.
fields (str) – A space-separated list of fields to extract.
cfg (dict) – Configuration dictionary.
output_file (str) – Path to the final TSV file where extracted fields will be written.
vcf_samples (list[str], optional) – List of sample names from VCF for per-sample column naming.
- Returns:
The output_file path that now contains the extracted fields (TSV).
- Return type:
- Raises:
RuntimeError – If the bcftools query command fails.
- variantcentrifuge.extractor.extract_fields_snpsift(variant_file, fields, cfg, output_file)[source]¶
Extract specified fields from variant records using SnpSift extractFields (legacy).
This is the original Java-based extraction method, kept as a fallback.
Write them directly to output_file, controlling the SnpSift field separator if needed.
- Parameters:
variant_file (str) – Path to the VCF file from which fields should be extracted.
fields (str) – A space-separated list of fields to extract (e.g. “CHROM POS REF ALT DP AD”).
cfg (dict) –
Configuration dictionary that may include tool paths, parameters, etc.
- ”extract_fields_separator”: str
The separator for multi-sample fields when using SnpSift -s …. Often a comma “,”. Defaults to “,” if not present.
- ”debug_level”: str or None
Optional debug level to control how much we log.
output_file (str) – Path to the final TSV file where extracted fields will be written.
- Returns:
The same output_file path that now contains the extracted fields (TSV).
- Return type:
- Raises:
RuntimeError – If the field extraction command fails.
- variantcentrifuge.extractor.extract_fields(variant_file, fields, cfg, output_file)¶
Extract specified fields from variant records using SnpSift extractFields (legacy).
This is the original Java-based extraction method, kept as a fallback.
Write them directly to output_file, controlling the SnpSift field separator if needed.
- Parameters:
variant_file (str) – Path to the VCF file from which fields should be extracted.
fields (str) – A space-separated list of fields to extract (e.g. “CHROM POS REF ALT DP AD”).
cfg (dict) –
Configuration dictionary that may include tool paths, parameters, etc.
- ”extract_fields_separator”: str
The separator for multi-sample fields when using SnpSift -s …. Often a comma “,”. Defaults to “,” if not present.
- ”debug_level”: str or None
Optional debug level to control how much we log.
output_file (str) – Path to the final TSV file where extracted fields will be written.
- Returns:
The same output_file path that now contains the extracted fields (TSV).
- Return type:
- Raises:
RuntimeError – If the field extraction command fails.