# Privacy and Pseudonymization VariantCentrifuge includes comprehensive pseudonymization features to help you share genomic data while protecting participant privacy. ## Overview The pseudonymization feature replaces original sample identifiers with consistent, non-identifiable pseudonyms throughout all output files. This enables data sharing for publication or collaboration while maintaining participant privacy. ## Quick Start Basic pseudonymization with sequential IDs: ```bash variantcentrifuge \ --gene-file genes.txt \ --vcf-file cohort.vcf \ --output-file results.tsv \ --pseudonymize \ --pseudonymize-table mapping.tsv ``` ## Naming Schemas ### Sequential Schema (Default) Simple numbered identifiers: ```bash --pseudonymize-schema sequential \ --pseudonymize-prefix STUDY_A # Generates: STUDY_A_001, STUDY_A_002, etc. ``` ### Categorical Schema Groups samples by phenotype or other metadata: ```bash --pseudonymize-schema categorical \ --pseudonymize-category-field phenotype # Generates: CASE_001, CASE_002, CONTROL_001, etc. ``` ### Anonymous Schema Letter-number combinations or hash-based IDs: ```bash --pseudonymize-schema anonymous # Generates: A001, A002, B001, etc. ``` ### Custom Schema Define your own pattern: ```bash --pseudonymize-schema custom \ --pseudonymize-pattern "{study}_{phenotype}_{index:04d}" # Generates: STUDY1_CASE_0001, etc. ``` ## Security Best Practices ⚠️ **Critical Security Information** ### 1. Mapping Table Security The pseudonymization table contains the key to re-identify your data: - Store it separately from pseudonymized outputs - Never share it with the pseudonymized data unless authorized - Consider encrypting the mapping file - Keep it in a secure location with restricted access ### 2. Archive Behavior The `--archive-results` flag does NOT include the mapping table: - Mapping tables are saved in the parent directory by design - This prevents accidental sharing of identifying information - Always verify archive contents before sharing ### 3. PED File Handling Use `--pseudonymize-ped` to create shareable pedigree files: - Original PED files should never be shared - Family relationships are preserved with pseudonymized IDs - The pseudonymized PED is saved alongside the mapping table ## Integration with Other Features ### With Inheritance Analysis Pseudonymization automatically handles inheritance pattern outputs: - Sample IDs in `Inheritance_Samples` column are replaced - Compound heterozygous partner information is preserved - Family relationships remain interpretable ### With HTML Reports The HTML report will use pseudonymized IDs throughout: - Sample columns in variant tables - IGV integration will show pseudonymized labels - Download links will use pseudonymized filenames ### With Excel Output Excel files automatically use pseudonymized data: - All sheets will contain pseudonymized IDs - Links and references remain functional ## Example Workflows ### Publication-Ready Analysis For a case-control study ready for publication: ```bash variantcentrifuge \ --gene-file cancer_genes.txt \ --vcf-file cohort.vcf \ --output-file publication_results.tsv \ --ped cohort.ped \ --pseudonymize \ --pseudonymize-schema categorical \ --pseudonymize-table ../SECURE/id_mapping.tsv \ --pseudonymize-ped \ --html-report \ --archive-results # Results: # 1. ./publication_results.tsv - pseudonymized variant data # 2. ./report/ - HTML report with pseudonymized IDs # 3. ../SECURE/id_mapping.tsv - mapping table (NOT in archive) # 4. ../SECURE/id_mapping_pedigree.ped - pseudonymized PED # 5. ../variantcentrifuge_results_*.tar.gz - shareable archive ``` ### Family Study with Custom IDs For trio or family studies: ```bash variantcentrifuge \ --gene-file genes.txt \ --vcf-file families.vcf \ --ped families.ped \ --output-file results.tsv \ --pseudonymize \ --pseudonymize-schema custom \ --pseudonymize-pattern "FAM{family}_{role}_{index:02d}" \ --pseudonymize-table family_mapping.tsv ``` ### Multi-Site Collaboration For multi-site studies with complete de-identification: ```bash variantcentrifuge \ --gene-file shared_genes.txt \ --vcf-file site_data.vcf \ --output-file site_results.tsv \ --pseudonymize \ --pseudonymize-schema anonymous \ --pseudonymize-table ../restricted/site_mapping.tsv \ --xlsx \ --html-report ``` ## Reversing Pseudonymization For internal use, you can reverse the process using the mapping table: ```python import pandas as pd # Load the mapping table mapping = pd.read_csv('mapping.tsv', sep='\t') reverse_map = dict(zip(mapping['pseudonym_id'], mapping['original_id'])) # Load pseudonymized results results = pd.read_csv('pseudonymized_results.tsv', sep='\t') # Function to reverse pseudonymization in GT column def reverse_gt(gt_value): if pd.isna(gt_value): return gt_value for pseudo, orig in reverse_map.items(): gt_value = gt_value.replace(pseudo + '(', orig + '(') return gt_value # Apply to GT column results['GT'] = results['GT'].apply(reverse_gt) # Save de-pseudonymized results results.to_csv('original_results.tsv', sep='\t', index=False) ``` ## Technical Details ### What Gets Pseudonymized - Sample IDs in the GT (genotype) column - Sample IDs in inheritance analysis columns - Sample IDs in PED files (if --pseudonymize-ped is used) - Any future columns that contain sample identifiers ### Consistency Guarantees - The same original ID always maps to the same pseudonym within a run - Pseudonyms are deterministic (sorted before assignment) - Duplicate handling ensures all pseudonyms are unique ### Performance Considerations - Pseudonymization adds minimal overhead - Processing happens in-memory before final output - No impact on analysis accuracy or completeness ## Troubleshooting ### Common Issues 1. **Missing mapping table path** ``` Error: --pseudonymize-table is required when using --pseudonymize ``` Solution: Always specify where to save the mapping table 2. **Custom schema pattern errors** ``` Warning: Missing placeholder in pattern: 'study' ``` Solution: Ensure all placeholders in your pattern have corresponding metadata 3. **Mapping table in wrong location** - The mapping table is saved in the parent directory of your output - This is intentional for security - Check one directory level up from your results ### Validation Checklist Before sharing pseudonymized data: - [ ] Verify mapping table is NOT in the results directory - [ ] Check that all sample IDs are replaced in output files - [ ] Confirm PED file is pseudonymized (if using pedigrees) - [ ] Test that results archive doesn't contain mapping table - [ ] Document which schema was used for reproducibility ## See Also - [Security Best Practices](../admin-guide/security.md) - [Data Sharing Guidelines](../admin-guide/data-sharing.md) - [CLI Reference](../reference/cli.md#data-privacy-options)