What is a Data Set?
by Stephen M. Walker II, Co-Founder / CEO
What is a data set?
A data set refers to a collection of related and structured information, typically comprised of numbers, text, or images. These data are used to train, test, and validate machine learning models or algorithms. The primary difference between frontier models is this data set.
More specifically, a data set is divided into three main parts: the training set used to fit the models; the validation set used to fine-tune them; and the test set used to provide an unbiased evaluation of the final model.
A well-structured and diverse data set is crucial in AI as its quality and comprehensiveness significantly impact the performance and accuracy of the AI system. A data set can encompass a wide range of data types, from images to text, and is instrumental in teaching AI models to recognize patterns and make accurate predictions.
What is the purpose of a data set?
A data set is a crucial resource used to train an AI model, particularly Large Language Models (LLMs). It is a diverse collection of data that instructs the AI model on recognizing patterns and making accurate predictions. The data set can encompass a wide range of data types, from images for tasks like facial recognition to vast amounts of text data from various sources for training LLMs. These sources can include books, articles, websites, and specific datasets like Wikipedia, Common Crawl, and other large-scale web scraping datasets.
The size, quality, and format of the data set are critical factors that influence the performance and fairness of the AI model. Therefore, ensuring high-quality data and appropriate data preparation, including format consistency and preventing data leakages, is a crucial step in the machine learning process.
What is the mean, median, mode, and range of a data set?
The mean, median, mode, and range are statistical measures that provide different types of insights about a data set.
-
Mean — The mean, also known as the average, is calculated by adding all the numbers in a data set and then dividing by the count of numbers in the set. For example, the mean of the data set
{2, 5, 9, 3, 5, 4, 7}
is calculated as (2 + 5 + 9 + 3 + 5 + 4 + 7) / 7 = 5. -
Median — The median is the middle number in a sorted (from smallest to largest) data set. If the data set has an odd number of observations, the median is the middle number. If the data set has an even number of observations, the median is the average of the two middle numbers. For example, the median of the data set
{2, 3, 4, 5, 5, 7, 9}
is 5, and the median of the data set{2, 3, 4, 5, 5, 9}
is (4 + 5) / 2 = 4.5. -
Mode — The mode is the number that appears most frequently in a data set. A data set may have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal). For example, in the data set
{2, 5, 9, 5, 4, 7}
, the mode is 5 because it appears twice, more than any other number. -
Range — The range is a measure of variability and represents the difference between the highest and lowest values in a data set. For example, the range of the data set
{2, 5, 9, 3, 5, 4, 7}
is calculated as 9 - 2 = 7.
These measures are commonly used in statistics to provide a summary of the data set and to understand the distribution and spread of the data.
What is the size of a data set?
The size of the data set required for AI depends on the specific application and the complexity of the task. For instance, a simple image recognition task might require a few hundred images, while a complex task like facial recognition might require millions of images. Vast (billions, trillions of tokens) amounts of text data from various sources are often required when training LLMs. These sources can range from books, articles, and websites to more specific datasets like Wikipedia, Common Crawl, and other large-scale web scraping datasets. Generally, the more complex the task and the larger the model, the larger the data set required.
What formats are used for data sets?
The common format of a dataset for training in machine learning depends on the type of data and the specific requirements of the machine learning framework being used. However, some of the most commonly used formats include:
-
Tabular Data — This is the most common and familiar format for machine learning. It consists of rows and columns, where each row represents an observation or a sample, and each column represents a feature or a variable. This format is easy to store, manipulate, and visualize using tools like spreadsheets, databases, or pandas. It is also suitable for many types of machine learning tasks, such as regression, classification, or clustering.
-
CSV Files — CSV (Comma Separated Values) files are popular for machine learning as they are easy to view/debug and easy to read/write from programs. They are text-based files containing comma-separated values.
-
Petastorm — This is a more modern file format that is feature-complete, language-independent, and scalable. It supports high-dimensional data and has native readers in TensorFlow and PyTorch. It also scales for parallel workers and can store many TBs of data.
-
Numpy Arrays (.npy) — Numpy arrays are commonly used in Python-based machine learning frameworks like TensorFlow and PyTorch. They are efficient and allow for complex numerical computations.
-
Parquet — Parquet is a binary single-table format that is becoming more popular. However, it is more suitable for single-table data and may not be ideal for multi-table data.
-
HDF5 — HDF5 is a widely accepted format in many scientific domains and the deep learning community. It provides built-in compression and is very flexible, allowing any machine learning dataset to be stored as a single file.
Remember, the choice of data format depends on the nature and source of your data, as well as the goal and scope of your machine learning project. It's also important to note that data preparation, which includes ensuring format consistency of records, is a crucial step in the machine learning process.
How does data set quality influence training?
Quality is crucial. A high-quality data set provides the necessary information for an AI system to learn from and make accurate predictions. The quality of a data set is determined by its size, diversity, and balance. A high-quality data set will have a large size, a high degree of diversity, and a balanced distribution of classes.
The quality of a dataset significantly influences the training and performance of machine learning models, including large language models (LLMs). This influence is multifaceted and can be understood in the following ways:
-
Model Performance — The quality and relevance of training data often determine the performance of a model. Techniques such as feature engineering and ensemble modeling can partially compensate for inadequate or insufficient training data, but the quality of input data typically sets an upper limit on a model's potential performance.
-
Data Relevance — The relevance of the data to the specific task or problem at hand is crucial. Even if a dataset is comprehensive and well-structured, it might not be useful if it doesn't contain the information necessary for the specific task or problem the model is designed to solve.
-
Bias and Fairness — The quality of the data also influences the fairness of the model. If the training data reflects societal biases, the model can perpetuate and amplify these biases in its predictions.
-
Alignment and Pre-training — High-quality data impacts every aspect of the LLM training pipeline, including alignment and pre-training. Alignment refers to training LLMs to output text that satisfies the goals of a human, while pre-training involves learning general-purpose representations from raw text. The quality and diversity of data have a massive impact on both these processes.
-
Data Accuracy — The accuracy of the data is critical to the success of AI and ML models. Qualitatively rich data yields better model outputs and consistent processing and decision-making. If the data is not accurate, the model may not function as intended.
-
Data Selection Bias — This refers to the systematic error that arises as a result of a given choice of the texts used to train language models. This bias can occur in the sampling stage, when the texts are identified, or when the data is filtered and cleaned. Data selection bias can cause other biases to manifest in a cascading fashion.
The quality of a dataset is a critical factor in the successful training and performance of machine learning models. It influences model performance, data relevance, bias and fairness, alignment and pre-training, data accuracy, and data selection bias. Therefore, ensuring high-quality data is a crucial step in any machine learning project.
What are common data sets for training LLMs?
Common data sets for training Large Language Models (LLMs) include vast amounts of text data from various sources. These sources can range from books, articles, and websites to more specific data sets like Wikipedia, Common Crawl, and other large-scale web scraping data sets. The choice of data set depends on the specific requirements of the LLM, such as the language, domain, and the level of complexity required for the task at hand. Common datasets used for training Large Language Models (LLMs) include:
-
Common Crawl — This dataset comprises terabytes of raw web data extracted from billions of web pages. It releases new data files that the crawler obtains each month. Several large language models, including GPT-3, LLaMA, OpenLLaMa, and T5, were trained with CommonCrawl.
-
RefinedWeb — This is another dataset sourced from web data.
-
The Pile — An 800 GB corpus curated from 22 diverse datasets, mostly from academic or professional sources. It was instrumental in training various LLMs, including GPT-Neo, LLaMA, and OPT.
-
C4 (Colossal Clean Crawled Corpus) — Derived from the Common Crawl dataset, C4 is preprocessed and filtered, making it a cleaner and more useful resource for training LLMs.
-
Starcoder Data — This dataset is used by HuggingFace to train a programmer-friendly model, StarCode, on code in different programming languages gathered from GitHub.
-
BookCorpus — This dataset turned scraped data of 11,000 unpublished books into a 985 million-word dataset. It was used for training LLMs like RoBERTA, XLNET, and T5.
-
ROOTS — Another dataset used for training LLMs.
-
Wikipedia — Many LLMs rely on Wikipedia in their training process to ensure a general knowledge base and improve their ability to generate relevant and coherent outputs across different domains.
-
Red Pajama — An open-source effort to replicate the LLaMa dataset. It comprises 1.2 trillion tokens extracted from Common Crawl, C4, GitHub, books, and other sources. Red Pajama's transparent approach helps train MPT-7B and OpenLLaMA.
-
LIMA — A high-quality, human-generated dataset used for instruction fine-tuning. Researchers carefully selected 1,000 instruction pairs to fine-tune the 65-billion-parameter Llama-v1 model, known as LIMA, using supervised fine-tuning.
It's important to note that the quality and diversity of the training dataset significantly impact the performance of the LLM. Therefore, data preprocessing, including cleaning, normalization, and preventing data leakages, is crucial when using these datasets.
What is the leading open source data set?
RedPajama-Data-v2 is an open dataset developed by Together AI for training large language models (LLMs). It is a significant upgrade from the previous version, RedPajama-Data-1T, with a massive increase in size and scope. The dataset includes over 100 billion text documents derived from 84 CommonCrawl snapshots.
The dataset contains 30 trillion filtered and deduplicated tokens, which is a substantial increase from the 1.2 trillion tokens in the first version. The raw data before filtering and deduplication was over 100 trillion tokens. The dataset covers five languages: English, German, French, Spanish, and Italian.
RedPajama-Data-v2 is designed to be a resource for the community, providing a base from which high-quality datasets for LLM training can be extracted and researched. It includes 40+ pre-computed data quality annotations that can be used for further filtering and weighting. The dataset is available on GitHub and HuggingFace.
The creators of RedPajama-Data-v2 envision it as a growing resource that will be enriched with additional domains, new snapshots, and more quality annotations over time. They also plan to add more quality annotations such as contamination annotations against popular LLM benchmarks, topic modeling and classification annotations for each document, and other annotations that the community finds useful.
Together AI is also building open models based on the RedPajama-Dataset-V2 and helping companies and organizations build custom models with principled mixes of open and proprietary datasets.
How does RedPajama stand out compared to previous open data sets?
The Red Pajama dataset is an open-source dataset designed for training large language models (LLMs). It was created as part of the RedPajama project, a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. The project aims to create leading open-source models by reproducing the LLaMA training dataset.
The RedPajama dataset has undergone several iterations. The first version, RedPajama-Data-1T, was a 1.2 trillion token dataset modeled on the training data described in the original LLaMA paper. The data was split across 2,084 different files and the full dataset was 2.67TB.
The latest version, RedPajama-Data-v2, contains 30 trillion filtered and deduplicated tokens, which is 30 times larger than the first version. This dataset provides the most complete coverage on CommonCrawl, with 84 dumps processed. It includes over 100 billion text documents coming from 84 CommonCrawl snapshots. The dataset is available in five languages: English, French, Spanish, German, and Italian.
The RedPajama dataset is designed to serve as a base from which high-quality datasets for LLM training can be extracted and based on which LLM training data can be thoroughly researched. It includes quality signals computed for all documents in the head+middle partition, which allows LLM developers to easily slice and filter the data, combining these into a new data quality pipeline to create their own pre-training dataset.
The RedPajama dataset is available on GitHub and HuggingFace, and all data processing scripts are open source. The raw data was 100 trillion tokens, but after cleaning and deduplication, it was reduced to 30 trillion tokens. The data cleaning pipelines used methods like naturalness, repetition, content-based signals, text similarity, and deduplication.