Data Process

Data processing stands as a natural and effective strategy to mitigate privacy leakage during the preparation of pretraining corpora and finetuning instruction datasets. This process involves masking or filtering personally identifiable information (PII) and other sensitive data. However, such measures can inadvertently reduce diversity and lead to information loss, potentially weakening the capabilities of Large Language Models (LLMs).

Moreover, while data anonymization techniques like k-anonymity, l-diversity, and t-closeness can be employed to eliminate privacy information from data, they may not fully protect against membership inference attacks, as signatures other than PII may still identify data providers.

Additionally, deduplication, although a simple and effective method, can improve model quality while mitigating privacy risks associated with training data extraction and membership inference attacks .

Last updated