Phenotype Module

Phenotype data loading and integration

Phenotype integration module.

This module loads a phenotype file containing sample-to-phenotype mappings and provides a function to aggregate phenotypes for a given list of samples.

  • The phenotype file must be .csv or .tsv (detected by extension).

  • The specified sample and phenotype columns must be present in the file.

  • Phenotypes are stored in a dictionary (sample -> set of phenotypes).

  • Given a list of samples, phenotypes are aggregated as follows: - For each sample, join multiple phenotypes by “,”. - For multiple samples, join each sample’s phenotype string by “;”.

variantcentrifuge.phenotype.load_phenotypes(phenotype_file, sample_column, phenotype_column)[source]

Load phenotypes from a .csv or .tsv file into a dictionary.

Parameters:
  • phenotype_file (str) – Path to the phenotype file (must be .csv or .tsv).

  • sample_column (str) – Name of the column containing sample IDs.

  • phenotype_column (str) – Name of the column containing phenotype values.

Returns:

dict of {str – A dictionary mapping each sample to a set of associated phenotypes.

Return type:

set of str}

Raises:

ValueError – If the file is not .csv or .tsv, or if the required columns are not found.

variantcentrifuge.phenotype.aggregate_phenotypes_for_samples(samples, phenotypes)[source]

Aggregate phenotypes for a given list of samples into a single string.

For each sample: - Join multiple phenotypes with “,”. For multiple samples: - Join each sample’s phenotype string with “;”.

Parameters:
  • samples (list of str) – List of sample IDs.

  • phenotypes (dict of {str: set of str}) – Dictionary mapping sample IDs to a set of phenotypes.

Returns:

A string aggregating all phenotypes for the given samples, with phenotypes comma-separated per sample, and samples separated by “;”.

Return type:

str

variantcentrifuge.phenotype.format_phenotypes_like_gt_column(samples, phenotypes)[source]

Format phenotypes in the same style as the GT column with sample IDs.

Creates a string similar to GT column format: “SampleID(phenotype1,phenotype2);SampleID(phenotype3);…”

This matches the format used in genotype replacement where each sample’s data is prefixed with the sample ID in parentheses.

Parameters:
  • samples (list of str) – List of sample IDs in the order they appear in the VCF.

  • phenotypes (dict of {str: set of str}) – Dictionary mapping sample IDs to a set of phenotypes.

Returns:

A string with phenotypes formatted like GT column: “Sample1(pheno1,pheno2);Sample2(pheno3);…” Samples without phenotypes get empty parentheses: “Sample3()”.

Return type:

str

variantcentrifuge.phenotype.extract_phenotypes_for_gt_row(gt_value, phenotypes)[source]

Extract phenotypes for samples that have variants in a specific GT row.

Parses the GT column value to find which samples have variants, then returns their phenotypes in the same format as the GT column.

Parameters:
  • gt_value (str) – GT column value like “Sample1(0/1);Sample2(1/1);Sample3(./.)”

  • phenotypes (dict of {str: set of str}) – Dictionary mapping sample IDs to a set of phenotypes.

Returns:

Phenotypes for samples with variants: “Sample1(pheno1,pheno2);Sample2(pheno3)” Samples with no variants (./. or 0/0) are excluded.

Return type:

str