With the rise of large language models (LLMs) — artificial intelligence programs, like ChatGPT, that crunch massive amounts of information to produce natural language responses to queries — the demand for high-quality training data has surged, but there's a growing concern that these models may soon exhaust available sources. This has led to frantic efforts by tech companies to secure data, sometimes through controversial means, sparking legal battles over copyright. The quality of training data is crucial, as it directly impacts the performance and biases of AI systems.
Shadi Rezapour, PhD, a researcher and assistant professor in the College of Computing & Informatics (CCI) who studies natural language processing and computational social science, highlights the challenges in acquiring diverse and representative datasets and emphasizes the importance of ethical data sourcing to avoid perpetuating biases and inaccuracies. Efforts to ensure data quality and transparency, such as dataset documentation and certification initiatives, are critical for the responsible development of AI.