Recently, EleutherAI – a small group of researchers devoted to open-source AI research – created The Pile, a massive dataset designed to train NLP models, such as GPT-2 and GPT-3, among others. The dataset is open-source, contains over 800GB of English language data, and is still growing.
The Methods
EleutherAI compiled a series of other popular language modeling datasets to create an overall diverse, thorough, and generalized one-stop-shop for NLP tasks. Some of the used datasets include Pile-CC, Wikipedia, PubMed Central, GitHub, Stack Exchange, YouTube, The US Patent and Trademark Office, and more. The 22 included datasets represent academic writing, fiction, code, and mathematics, creating diverse possibilities. The Pile also introduces OpenWebText2 and BookCorpus2, extensions of their original versions.
The Goals
Since most large language models are trained on private datasets based on common crawl data, their downstream generalization capabilities are limited. However, with dataset diversity – a core feature of The Pile – language modeling tasks will lead to improved downstream generalization capabilities.
While initially conceived as a training dataset for large-scale models, The Pile’s diverse nature proved to be useful as an evaluation tool.
The researchers hope that by using all of this data, they may be able to replicate the GPT3, only with more diverse data and for free. They also hope to create datasets in languages other than English in the future.
The Future of NLP
NLP jobs are on the rise and require a plethora of skills to stand out, including the aforementioned GPT-3.
The ODSC on-demand training platform, Ai+ Training, offers a number of videos that will help you get up-to-date on the latest NLP skills, tricks, tools, platforms, libraries, and research advancements. Here are a few standout talks:
An Introduction to Transfer Learning in NLP and HuggingFace Tools: Thomas Wolf, PhD | Chief Science Officer | Hugging Face
Natural Language Processing Case-studies for Healthcare Models: Veysel Kocaman | Lead Data Scientist and ML Engineer | John Snow Labs
Transform your NLP Skills Using BERT (and Transformers) in Real Life: Niels Kasch, PhD | Data Scientist and Founding Partner | Miner & Kasch
A Gentle Intro to Transformer Neural Networks: Jay Alammar | Machine Learning Research Engineer | jalammar.github.io
Level Up: Fancy NLP with Straightforward Tools: Kimberly Fessel, PhD | Senior Data Scientist, Instructor | Metis
Build an ML pipeline for BERT models with TensorFlow Extended – An end-to-end Tutorial: Hannes Hapke | Senior Machine Learning Engineer | SAP Concur
Natural Language Processing: Feature Engineering in the Context of Stock Investing: Frank Zhao | Senior Director, Quantamental Research | S&P Global
Transfer Learning in NLP: Joan Xiao, PhD | Principal Data Scientist | Linc Global
Developing Natural Language Processing Pipelines for Industry: Michael Luk, PhD | Chief Technology Officer | SFL Scientific
Deep Learning-Driven Text Summarization & Explainability: Nina Hristozova | Junior Data Scientist | Thomson Reuters