Friday, September 15 @ 2:35pm
Wege Auditorium
The data in data science: measuring the impact of data curation on large language model pretraining
Large language models like BERT and ChatGPT are fundamentally a reflection of the data used to train them. Putting together millions of documents from diverse sources requires innumerable choices. But because of the time and expense of the initial, general-purpose “pretraining” phase of model training, many of these choices are made heuristically without any systematic evidence-based justification. We train models to measure the effects of three common curation decisions: document age, quality and toxicity filtering, and data sources. We find that these choices have significant, noticeable effects that cannot be fully overcome by additional training.
David Mimno is an associate professor in the Information Science department at Cornell University. His work centers around the use of machine learning and natural language processing in humanities and social science applications. He holds a Ph.D. from UMass Amherst and was previously the head programmer at the Perseus Project at Tufts University and a postdoctoral fellow at Princeton University. His work in machine learning has been supported by the Sloan Foundation, the NEH, and the NSF.