Data Providers

Current LLMs (e.g., GPT-4, Gemini, Llama) are trained with data of trillions of tokens, indicating the imminent depletion of high-quality data. To enhance model capability, incorporating more high-quality private data is imperative. From the perspective of data providers, concerns arise regarding privacy disclosure via data sharing, as data is utilized by the model training task and executed on computing powers. Importantly, models trained on their data should not disclose privacy, such as personally identifiable information (PII). Furthermore, incentivizing mechanisms are crucial to encourage data providers to produce more high-quality data. However, the current model training process lacks incentivization, failing to reward high-quality data via model monetization.

Last updated