Privacy and Pseudonymization¶

VariantCentrifuge includes comprehensive pseudonymization features to help you share genomic data while protecting participant privacy.

Overview¶

The pseudonymization feature replaces original sample identifiers with consistent, non-identifiable pseudonyms throughout all output files. This enables data sharing for publication or collaboration while maintaining participant privacy.

Quick Start¶

Basic pseudonymization with sequential IDs:

variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file cohort.vcf \
  --output-file results.tsv \
  --pseudonymize \
  --pseudonymize-table mapping.tsv

Naming Schemas¶

Sequential Schema (Default)¶

Simple numbered identifiers:

--pseudonymize-schema sequential \
--pseudonymize-prefix STUDY_A
# Generates: STUDY_A_001, STUDY_A_002, etc.

Categorical Schema¶

Groups samples by phenotype or other metadata:

--pseudonymize-schema categorical \
--pseudonymize-category-field phenotype
# Generates: CASE_001, CASE_002, CONTROL_001, etc.

Anonymous Schema¶

Letter-number combinations or hash-based IDs:

--pseudonymize-schema anonymous
# Generates: A001, A002, B001, etc.

Custom Schema¶

Define your own pattern:

--pseudonymize-schema custom \
--pseudonymize-pattern "{study}_{phenotype}_{index:04d}"
# Generates: STUDY1_CASE_0001, etc.

Security Best Practices¶

⚠️ Critical Security Information

1. Mapping Table Security¶

The pseudonymization table contains the key to re-identify your data:

Store it separately from pseudonymized outputs
Never share it with the pseudonymized data unless authorized
Consider encrypting the mapping file
Keep it in a secure location with restricted access

2. Archive Behavior¶

The --archive-results flag does NOT include the mapping table:

Mapping tables are saved in the parent directory by design
This prevents accidental sharing of identifying information
Always verify archive contents before sharing

3. PED File Handling¶

Use --pseudonymize-ped to create shareable pedigree files:

Original PED files should never be shared
Family relationships are preserved with pseudonymized IDs
The pseudonymized PED is saved alongside the mapping table

Integration with Other Features¶

With Inheritance Analysis¶

Pseudonymization automatically handles inheritance pattern outputs:

Sample IDs in Inheritance_Samples column are replaced
Compound heterozygous partner information is preserved
Family relationships remain interpretable

With HTML Reports¶

The HTML report will use pseudonymized IDs throughout:

Sample columns in variant tables
IGV integration will show pseudonymized labels
Download links will use pseudonymized filenames

With Excel Output¶

Excel files automatically use pseudonymized data:

All sheets will contain pseudonymized IDs
Links and references remain functional

Example Workflows¶

Publication-Ready Analysis¶

For a case-control study ready for publication:

variantcentrifuge \
  --gene-file cancer_genes.txt \
  --vcf-file cohort.vcf \
  --output-file publication_results.tsv \
  --ped cohort.ped \
  --pseudonymize \
  --pseudonymize-schema categorical \
  --pseudonymize-table ../SECURE/id_mapping.tsv \
  --pseudonymize-ped \
  --html-report \
  --archive-results

# Results:
# 1. ./publication_results.tsv - pseudonymized variant data
# 2. ./report/ - HTML report with pseudonymized IDs
# 3. ../SECURE/id_mapping.tsv - mapping table (NOT in archive)
# 4. ../SECURE/id_mapping_pedigree.ped - pseudonymized PED
# 5. ../variantcentrifuge_results_*.tar.gz - shareable archive

Family Study with Custom IDs¶

For trio or family studies:

variantcentrifuge \
  --gene-file genes.txt \
  --vcf-file families.vcf \
  --ped families.ped \
  --output-file results.tsv \
  --pseudonymize \
  --pseudonymize-schema custom \
  --pseudonymize-pattern "FAM{family}_{role}_{index:02d}" \
  --pseudonymize-table family_mapping.tsv

Multi-Site Collaboration¶

For multi-site studies with complete de-identification:

variantcentrifuge \
  --gene-file shared_genes.txt \
  --vcf-file site_data.vcf \
  --output-file site_results.tsv \
  --pseudonymize \
  --pseudonymize-schema anonymous \
  --pseudonymize-table ../restricted/site_mapping.tsv \
  --xlsx \
  --html-report

Reversing Pseudonymization¶

For internal use, you can reverse the process using the mapping table:

import pandas as pd

# Load the mapping table
mapping = pd.read_csv('mapping.tsv', sep='\t')
reverse_map = dict(zip(mapping['pseudonym_id'], mapping['original_id']))

# Load pseudonymized results
results = pd.read_csv('pseudonymized_results.tsv', sep='\t')

# Function to reverse pseudonymization in GT column
def reverse_gt(gt_value):
    if pd.isna(gt_value):
        return gt_value
    for pseudo, orig in reverse_map.items():
        gt_value = gt_value.replace(pseudo + '(', orig + '(')
    return gt_value

# Apply to GT column
results['GT'] = results['GT'].apply(reverse_gt)

# Save de-pseudonymized results
results.to_csv('original_results.tsv', sep='\t', index=False)

Technical Details¶

What Gets Pseudonymized¶

Sample IDs in the GT (genotype) column
Sample IDs in inheritance analysis columns
Sample IDs in PED files (if –pseudonymize-ped is used)
Any future columns that contain sample identifiers

Consistency Guarantees¶

The same original ID always maps to the same pseudonym within a run
Pseudonyms are deterministic (sorted before assignment)
Duplicate handling ensures all pseudonyms are unique

Performance Considerations¶

Pseudonymization adds minimal overhead
Processing happens in-memory before final output
No impact on analysis accuracy or completeness

Troubleshooting¶

Common Issues¶

Missing mapping table path
```
Error: --pseudonymize-table is required when using --pseudonymize
```
Solution: Always specify where to save the mapping table
Custom schema pattern errors
```
Warning: Missing placeholder in pattern: 'study'
```
Solution: Ensure all placeholders in your pattern have corresponding metadata
Mapping table in wrong location
- The mapping table is saved in the parent directory of your output
- This is intentional for security
- Check one directory level up from your results

Validation Checklist¶

Before sharing pseudonymized data:

Verify mapping table is NOT in the results directory
Check that all sample IDs are replaced in output files
Confirm PED file is pseudonymized (if using pedigrees)
Test that results archive doesn’t contain mapping table
Document which schema was used for reproducibility