A diamond needs to be processed before it can exhibit its brilliant qualities.

Raw data is like a rough diamond you need to refine before it can be useful. Also called “atomic data” or “primary data,” raw data is data that has just been collected from various sources and is still disorganized so it may not provide any clear insight. It has yet to undergo manual or computer processing to serve any useful purpose.


Read More about “Raw Data”

Humans extract, analyze, process, and use raw data through software that can help them come up with conclusions and draw projections for their current requirements.

Difference between Data and Raw Data

The data that we commonly see is already processed and converted into a format that people or machines can easily understand. Such data is derived from raw data.

Different types of data can be derived from raw data, depending on how it is going to be used. It’s quite similar to a tray of eggs. If you want to make meringues, you have to extract the egg whites only. In the same way, you can extract only the points that are most valuable in a particular use case from raw data.

Therefore, the main difference between data and raw data is that raw data is a jumbled mixture of different information. On the other hand, data or processed data has already extracted relevant and valuable information from raw data.

The Sushi Principle

Some may think that processed data is much easier to use because it is already structured. However, in the context of data analysis and machine learning (ML), data must be presented in its raw format.

According to the Sushi Principle, raw data is much better than “cooked” or “processed” data since the use of the former allows you to quickly carry out data analysis in a secure and easily understandable manner. Data can be efficiently processed when you can do continuous reiterations of queries rather than planning for them beforehand.

3 Ways to Keep Data in Raw Format

When you want to use your data in its raw format, here are the different ways to do it:

1. Using a Simple and Well-Tested Pipeline

When you have complicated and numerous pipelines, it is easy for you to collect data. However, there would be no way to verify if the machines are doing accurate calculations. A simple pipeline also gives you the freedom to be flexible in terms of the tasks you want to do with your data.

2. Collecting and Keeping Original Data

Keeping a copy of all your original data gives you the ability to trace the source of any statistic and allows you to iterate as many queries as possible. In short, when you have all of your original raw data, you don’t have to waste time backtracking, and you can proceed with doing tasks that add value to your work.

3. Summarizing and Sampling During Queries

This process allows you to make sure that none of your summary statistics have any flaws. Plus, it is easier for you to take samples of your data once you have an idea of what information you need. That way, it would be easier for you to get meaningful and statistically significant results.

Summarizing and sampling early on can make your data prone to inaccuracies. Shrinking your data may leave you with less than enough to derive statistically significant responses to your queries.

Users typically need raw data for analysis and often obtain such by extracting a data file that may take the form of a comma separated values (CSV) file.

Raw Data Examples

Raw data examples can be as simple as students’ grades in a class, such as the data in the video below.

Given the grades of 30 students (raw data), you can create a processed data set in the form of a frequency distribution table. This table shows how many students got grades that belong to a particular range, helping professors understand how the whole class is performing.

A more complex raw data example is the data transferred between different computer components. If you transfer a video file to an external hard drive, the data can be picked up and processed by tools, such as busTRACE. The raw data captured by the tool is boxed in red below.

what is raw data

As you can see, the raw data may not make sense. To process it, the tool has buttons that allow users to understand what’s happening, such as identifying patterns and determining matches.

We may not encounter raw data as frequently as data scientists, statisticians, and computer programmers do. Still, it’s important to learn what raw data is and appreciate where the data we are familiar with now actually came from.

Key Takeaways

  • Raw data refers to information that needs to be refined before it can become useful. It is also called “atomic data” or “primary data.”
  • Raw data is a jumbled mixture of different information while processed data has already been extracted from raw data because it is relevant and valuable.
  • In ML, raw data is more useful than processed data.
  • There are at least three ways to keep data in raw format, namely, using a simple and well-tested pipeline, collecting and keeping original data, and summarizing and sampling during queries.

Other interesting terms…