Data Glossary
What is Analytics?
The process of systematically analyzing data and statistics is called analytics. It helps you draw meaningful conclusions and understand what is happening behind the numbers. The insights you get can then lead you to take the appropriate action.
For instance, you can use analytics to examine data generated by your website’s traffic, your customers’ buying patterns, your day-to-day business activities, and whatever areas are relevant to your business. It’s an insightful tool when formulating a company’s strategic directions.
What is Big Data?
A massive volume of both structured and unstructured data is referred to as "Big Data". You can compare it to a gigantic tsunami. But instead of water you have a tidal wave of data.
Big data is characterized by volume, velocity, and variety. The amount is so large that it is cannot be processed using conventional database techniques. The speed with which data comes in and is gathered is very fast and can exceed available computer processing capacity. Data also comes in different formats — i.e., text files, audio and video files, email, spreadsheets, etc.
What is Big Data Architecture?
Big data architecture is a system for managing huge volumes of data, which may be too complicated for traditional database management systems (DBMSs) to handle. It dictates how data must be processed, stored, accessed, and consumed.
Big data architecture directs how enterprises use big data analytics. As such, it serves as a blueprint for big data networks and solutions. It shows how solutions would work, what components must be put in place, how information processes would proceed, and what security measures must be taken.
Think of big data architecture as a water dam built to control water flow, making it easier for everyone to receive the right amount of water supply.
What is Business Intelligence (BI)?
The king needs to make an important decree and wants to make sure that his decision will be a wise one. He summons his esteemed council of elders, and everyone contributes their piece of knowledge. Everything is then analyzed and translated into a decision that can change the course of the realm.
In modern corporate setups, that’s called business intelligence — a collection of practices, tools, and technologies that are used to gather and analyze raw data and transform it to meaningful business information. This helps executives and managers improve the business decisions they make.
What is Clickstream Data?
Clickstream data refers to any information obtained from an Internet user’s online activities, including the search terms they used, web pages they visited, and links they clicked. Clickstream data helps websites keep track of a buyer’s journey from the first time they searched for an item and the landing page they visited to actual purchase or cart abandonment.
E-commerce companies are just some of those that collect clickstream data. Advertising companies, Internet service providers (ISPs), social networking sites, and telecommunications companies also collect this data.
What is Computational Learning Theory?
Computational learning theory (CoLT) refers to applying formal mathematical methods to learning systems using theoretical computer science tools to quantify learning problems. This task includes discerning how hard it is for an artificial intelligence (AI) system to learn specific tasks.
Simply put, CoLT is an AI subfield devoted to studying the design and analysis of machine learning (ML) algorithms. It analyzes how difficult it will be for an AI system to learn a task.
What is Data Aggregation?
Data aggregation is the process of summarizing or grouping the records in a dataset. If you have, say, a list of 10,000 students in a university, data aggregation entails that you group them according to specific categories. Depending on what’s required, you may aggregate the list of students according to their programs and courses, for instance. You may also group them according to year level, gender, or age.
Beyond the university example, data aggregation is useful across all industries, including finance and banking, technology, production, marketing, and advertising.
What is Data Annotation?
Data annotation is simply the process of labeling information so that machines can use it. It is especially useful for supervised machine learning (ML), where the system relies on labeled datasets to process, understand, and learn from input patterns to arrive at desired outputs.
In ML, data annotation occurs before the information gets fed to a system. The process can be likened to using flashcards to teach children. A flashcard with the picture of an apple and the word “apple” would tell the children how an apple looks and how the word is spelled. In that example, the word “apple” is the label.
What is a Data Center?
A data center could be a corner in the office where the server sits. Or it can be an entire building that houses an array of computers, telecommunication devices and storage facilities. Whatever the case, it is the powerhouse of an organization's IT operations.
The data center is where critical information is kept, managed, and distributed. It is the key to the continuity of the business, and so must be heavily secured and protected from natural calamities.
What is Data Center Consolidation?
Data center consolidation is the process of merging information technology (IT) systems to form a more powerful and efficient system that runs on fewer resources. The process includes data center hardware, such as storage systems and servers. However, it also covers technologies, such as applications and cloud computing. Companies can also include locations during the data center consolidation process.
The concept is similar to downsizing your kitchen by getting rid of small appliances and replacing them with multipurpose equipment. For instance, some multipurpose cookers can air-fry, grill, pressure-cook, and slow-cook, so you don’t need four different appliances. Not only will you save money, but you can also maximize your space.
What is a Data Custodian?
A data custodian oversees the storage, transfer, and transport of data, ensuring that these processes adhere to the rules specified by the organization. Data custodians take care of data and the databases where it is stored.
A data custodian can also be referred to as a “database administrator.” It is one of the job titles critical to data governance or the effective and efficient management of information. One of the critical roles of data custodians in data management is restricting access to authorized users only.
While the role is often confused with that of data stewards, the two differ. Data stewards are subject matter experts responsible for selecting high-quality data and creating policies for managing it. On the other hand, data custodians see to it that the security policies and requirements specified by the data stewards are in place.
What is a Data Feed?
If you had a sweetheart who sent you love letters daily, you'd be constantly updated with what's going on in his or her life. A data feed works like that. It is a stream of data updates delivered to users regularly.
Examples of data feeds include your social media news feed, the RSS feed in blogs and automated product updates sent to price comparison sites, search engines, and more.
What is Data Gravity?
Data gravity refers to the power of a data set to attract other information, applications, and services. The idea is similar to Newton’s Law of Gravity, which states that a particle’s ability to attract other objects is directly proportional to the product of their masses. The greater mass a piece of matter has, the greater its gravitational force, and the more objects it can draw to itself.
In information technology (IT), this universal law translates to: The larger a data set is, the more it attracts other information and applications. Because of data gravity, applications, services, and even other information would naturally fall into the most massive data set.
What is Data in Motion?
Data in motion is one of three states of digital information flow or transfer from one place to another. It can occur within the same computer, such as when you copy a file from one folder to another. It can also describe the process of transferring files from one computer to another, like sending an email or obtaining a file from a portable flash drive.
Data in motion is also called “data in flight” and “data in transit.” The other states of data are data at rest (i.e., digital information that isn't accessed or used, such as backup and archived files) and data in use (i.e., data in word processors, spreadsheets, and other Office or database applications).
What is Data Independence?
Data independence is a database management system (DBMS) characteristic that lets programmers modify information definitions and organization without affecting the programs or applications that use it. Such property allows various users to access and process the same data for different purposes, regardless of changes made to it.
A database containing patient information, for example, could serve various purposes. A hospital’s billing department can use the data to obtain patients’ charges, discounts, and insurance details. On the other hand, the food services department would need the same data to see the patients’ nutritional requirements. How each department uses the data should not affect the stored information regardless of the changes it undergoes, such as where the patient details are stored or how they are labeled.
What is Data Management?
Data management is the opposite of hoarding or keeping things just for the sake of keeping and without any logical order or purpose. It is a process that involves gathering data and making sure that it is correct, storing it, organizing and protecting it, and making it available to those who need it.
Some applications of data management include ensuring efficient storage, selecting the right file format to structure data, and preventing unauthorized access.
What is a Data Management Platform (DMP)?
A data management platform (DMP) allows users to gather, organize, and manage customer data. Its primary use is for data-driven marketing. It serves as the backbone that lets marketers gain usable and unique insights into customers and their respective buying journeys.
Using a data management platform helps optimize most marketing and advertising campaigns. In general, it stores customer data, such as mobile identifiers, cookie identifiers, and campaign data. It allows marketers to categorize customer segments better based on demographic information, past browsing and buying behaviors, location, and device, among others.
What is a Data Mesh?
A data mesh is a new approach to designing and developing data architectures. It lets users do away with the challenges involved in accessing data. It does so by creating a connectivity layer to control, manage, and support data access.
A data mesh stitches data stored in various devices and even organizations together. At its core, it makes data highly available, easily discoverable, secure, and interoperable with the applications that need access to it. It is not centralized and monolithic.
What is Data Mining?
Data mining is an activity that involves examining large amounts of data to find new information. It's very similar to how metals and minerals are mined. The miners sift through rock and dirt until they hit the valuable stuff.
But while mineral miners use excavation equipment, the data miner's tools include statistical methods, mathematics, databases, and spreadsheets. Instead of precious metals and minerals, they look for patterns and correlations between clusters of data.
What is Data Modeling?
Data modeling is the process of transcribing a complicated software system design into a simple diagram that uses text and symbols to show how data needs to move forward in it. This data structure ensures the optimal use of information to guide business intelligence processes, such as the creation of a novel software or the design and implementation of a database.
Since data modeling highlights what information is necessary and the processes it needs to go through, it serves the same purpose as an architectural building plan. In this analogy, it tells workers how each step is related to others to ensure a smooth construction process.
What is a Data Packet?
Data packets are units of information collected into one set for transmission through the Internet. Any bit of data that needs to be sent from one system to another must first be broken into smaller pieces to ease communication. Upon reaching the endpoint, these pieces get reassembled to become readable.
Data packets are used in Internet Protocol (IP)-based systems that communicate with one another over the Web. A data packet is also called a “block,” a “datagram,” or a “frame,” depending on the protocol used for its transmission.
To better understand what a data packet is, think of an image that you would like to send to a friend via iMessage. The image would be divided into small pieces before it gets sent, which happens in the background, of course. Your friend sees only the reassembled image afterward.
What is Data Perturbation?
Data perturbation protects information by adding “noise” to a database to render individual records unreadable to unauthorized users. Noise, in this case, could be anything that interrupts data transmission or communication by corrupting signal quality. Only authorized users can do away with the noise to understand the information sent.
Data perturbation is typically applied to electronic health records (EHRs) to protect sensitive information from prying eyes. Its use, however, is not limited to the healthcare industry.
What is Data Resilience?
Data resilience refers to the ability of any data storage facility and system to bounce back despite service disruptions, such as power outages, data corruption, natural disasters, and equipment failure. It is often part of an organization’s disaster recovery plan.
Data resilience commonly involves storing data in multiple locations so users and applications can still access them even if the primary location gets compromised. Various data resilience strategies, such as backup creation, replication, redundancy, and cloud storage, are available.
Data resiliency is part of the overall business resiliency program, which details everything about its business continuity strategies and can be determined via a business impact analysis.
What is Data Sanitization?
Data sanitization is the process of completely removing the data stored on a memory device to make it unusable. The process is deliberate, permanent, and irreversible. As such, any device that undergoes data sanitization will no longer have usable residual data. Even if you use the most advanced forensic tools on the sanitized devices, you can no longer recover the data.
As the name implies, data sanitization can be likened to cleaning your home and getting rid of all things you no longer use. You can’t get the stuff you threw or gave away back no matter what you do.
What is Data Science?
Data science is the study of data. It involves finding better ways to record and store data, and also how to extract knowledge and insights from it. Data science uses a blend of programming, algorithm development, and data inference to solve problems that require complex analysis.
It is one of the most important fields of study today because all industries generate and use data.
What is a Data Set?
A data set is a collection of data that relates to a subject. In database terms, it will correspond to the contents of a single table. Each row is a member of the data set, and each column represents a specific variable that applies to the members.
For example, a list of students in a class will be the members of the data set. Columns can represent each of the tests they have taken during the semester. Enter the corresponding test scores for each student. The resulting table is a data set representing the students' test scores for the semester.
What is Data Vaulting?
Data vaulting refers to the process of sending organizational data offsite. It aims to protect the data from theft, hardware failures, and other threats.
Data vaulting is commonly practiced by organizations that handle overly sensitive data that is prone to misuse and abuse. Third-party storage providers that offer data vaulting services compress, encrypt, and transmit data to a remote vault that uses remote backup services (RBSs).
Data vaulting is like keeping your most important personal files like passports, stock certificates, and deeds in a safety deposit box instead of your safe at home. That way, even if thieves break into your house and safe, your documents will remain secure.
What is Data Visualization?
Data visualization is a method of graphically rendering data. Graphs and charts are very good examples of this. Representing the information this way allows us to get a visual perspective of how data relate to each other and helps us better understand the situation presented. It is an engaging way to communicate information.
For example, knowing that your company's market share is a measly 10% might make you anxious. But a pie chart showing that there are hundreds of competitors might help you appreciate that you have the largest share of the market and that your company is in a highly competitive industry.
What is Data Wrangling?
Data wrangling refers to collecting, choosing, and formatting data to answer an analytical question. It is also called “data cleaning” or “munging.” And data experts spend more time on it than actual exploration and modeling.
Data wrangling may include further munging, visualization, aggregation, and statistical model training. It follows general steps that start with extracting raw data from a source. The data is then sorted or parsed into a predefined structure. After that, the resulting content is stored in a data sink for storage and future use.
What is Data-Driven Programming?
Data-driven programming is a programming model characterized by program statements that describe the data instead of a sequence of actions. For example, an email filtering system may be programmed to block emails from malicious email addresses.
The code would only consist of a few lines, including a command that calls the database containing the list of malicious email addresses. The database will need to be updated regularly with new dangerous emails, but the email filtering program’s code will remain the same.
In contrast, some programming paradigms require developers to update the code every time data is added or updated.
What is a Database?
A database is a collection of data. Nowadays, it has come to mean an electronic file of data organized for easy retrieval. It can be arranged in a tabular format and contain multiple tables, each with different fields.
Imagine a daily planner or the bank statements you have been consistently receiving, your file of recipes or a doctor's filing cabinet of his patients' medical records. These are all examples of databases.
What is Datafication?
Datafication is a current technological trend that aims to transform most aspects of a business into quantifiable data that can be tracked, monitored, and analyzed. It refers to the use of tools and processes to turn an organization into a data-driven enterprise.
The term “datafication” was introduced by Kenneth Cukier and Victor Mayer-Schöenberger in 2013 to refer to transforming invisible processes into data that companies can use to optimize their business.
What is Deep Data?
Deep data refers to big data that is of high quality, relevant, and actionable. Data experts have processed it for the use of all employees in an organization.
Deep data is typically broken down into smaller sections for more efficient handling. All unnecessary or unusable information is removed, ensuring relevance.
What is Disk Mirroring?
Disk mirroring is a preservation technique that duplicates data to other hard drives as it is written. For example, when you save a picture into a computer, the data is automatically duplicated to other hard drives connected to the same computer or network. If one disk gets corrupted, you can still retrieve the file from the mirrored hard drives. The technique helps ensure there is always a backup of any data written to a disk.
In disk mirroring, the hard drives are connected to one another through a disk controller card, an electrical computer component that enables hard drives to communicate with each other. Disk mirroring is also called “Redundant Array of Independent Disks 1 (RAID 1).”
What is Quantile Normalization?
Quantile normalization, in the field of statistics, is a technique that makes two distributions identical in statistical properties. The two distributions in this instance, which we’ll discuss later, are the test and reference distributions. To make them identical in terms of statistical properties, the highest entry in the test and reference distributions should be aligned, followed by the next highest, and so on.
While it sounds complex, you can think of it as two lines of five students arranged by height (i.e., shortest to tallest). The first line could have Ross, Chandler, Joey, Gunther, and Frank, and the second could have Phoebe, Monica, Rachel, Ursula, and Janice. To quantile-normalize the lines, Ross and Phoebe (the shortest male and female, making them identical in the statistical property height) will be the first test and reference subjects, respectively, followed by Chandler and Monica, and so on.
What is Metadata?
Metadata, simply put, is data about other data.
Imagine an invoice for a purchase your company has made. Think of the line items in the invoice as the data. This tells you what particular item was purchased, the quantity, the unit cost, etc. The metadata will be the stub, or the header portion, of the invoice. It describes what the line items refer to, who was involved in the transaction and when it happened.
Examples of metadata are file size, date created, who created it, and description of contents. They give you information on what to expect before accessing the data itself.
What is MPEG-21?
MPEG-21, which stands for “Motion Picture Experts Group 21,” is a standard for the use and delivery of multimedia. It was created to have a common scheme for putting different multimedia elements and resources together. MPEG-21 is also known as “International Organization for Standardization (ISO) 21000” or the “Multimedia Framework.”
Think of MPEG-21 as a guidebook for developers and users of multimedia content. It describes how multimedia files should be created, distributed, and used. It also defines the characteristics that multimedia files should have.
What is MPEG-4?
MPEG-4 stands for “Moving Pictures Expert Group 4,” a standard for audio and video coding compression. The method reduces the size of an audio or a video file while retaining its fidelity or quality.
MPEG-4 is commonly shortened to “MP4.” MP4 files can be read by computer applications such as QuickTime and VLC. It is also widely used on Blu-ray discs and DVDs.
Think of MPEG-4 as a user manual that tells audio and video recorders and editors how to compress files without making their quality suffer for faster online streaming or CD distribution.
What is Packet Switching?
Packet switching refers to the method of dividing data into packets to make its transmission over a network faster and more efficient. The concept was first developed in the early 1960s, but the term wasn’t coined until a few years later. Since then, packet switching has become a fundamental concept behind how the Internet works.
The concept is similar to sending a giant puzzle to a friend via the postal service. Since the puzzle is too large and may cause congestion, the post office staff decides to mail its pieces separately so transport would become easier and faster. In the same way, anything sent over a network is divided into data packets to manage the traffic better.
What is Raw Data?
A diamond needs to be processed before it can exhibit its brilliant qualities.
Raw data is like a rough diamond you need to refine before it can be useful. Also called “atomic data” or “primary data,” raw data is data that has just been collected from various sources and is still disorganized so it may not provide any clear insight. It has yet to undergo manual or computer processing to serve any useful purpose.
What is Responsible Data Science?
Responsible data science is a consortium where leading Dutch research groups across disciplines join forces to address urgent and challenging problems. These issues are bound to lead to a big data winter if not addressed. That means misusing data and mistrusting data science results.
The responsible data science consortium provides experts with ideas to realize scientific breakthroughs that will lead to using data positively.
What is a Sample?
A sample is a small portion of the entire data set. It is used to represent the larger group so that analyzing the data can be done much more quickly and easily. A sample is selected according to a defined procedure and is used in statistical methodology and quantitative research.
Let's say you want to know how the inhabitants of a city feel about tattoos. You can talk to each and every one of them, but this would take much time and effort. Instead, you can take a small representative group of people and work on this sample of the population.
What is Space Shifting?
Space shifting is the process of converting digital media from one format to another. It is also commonly referred to as “place shifting,” which involves moving digital media such as music or movies from one storage device to another. The practice is often frowned upon because it can be used for copyright infringement.
For example, space shifting can take the form of viewing a TV series from a Wi-Fi-connected tablet or converting someone else’s CD tracks into MP3s so you can listen to them for free.
Other space shifting processes may include time shifting, where a radio broadcast is recorded and listened to later, and format shifting, where a media file is converted into a different format.
What is Statistical Data Analysis?
Statistical data analysis is the process of scrutinizing and interpreting the logical implications of a data sample. It uses statistical tools and mathematical computations to study the relationships between the sets of information being analyzed. It also looks for hidden patterns that can be used to predict certain outcomes.
In simple terms, without statistical data analysis, data is merely a jumble of numbers, words, and text that make no sense or serve no purpose. The process can also be likened to a person who is carefully piecing together the different parts of a puzzle, trying different combinations before finally succeeding in creating a meaningful picture.
What is Storage Tiering?
Storage tiering is simply the process of optimizing the use of available storage resources. It also enables effective ways to back up data, save on costs, and employ the best storage technology for each kind of data.
Users apply storage tiering to all devices to keep data in to handle information volume growth while incurring as little additional cost as possible. Various kinds of storage, including cloud-based, object, and distributed, can benefit from the strategy.
What is Storage Visualization?
Storage visualization is the process of visually displaying computer storage systems for more straightforward process navigation. It allows users to browse storage systems based on a graphical timeline of events. The representation helps them see what happens to a storage system every time it is presented with a specific event. That way, it is easy to display and compare structures, depending on time and predefined templates.
In very simple terms, storage visualization helps users view events that happen to a storage system to see how it relates to a particular configuration.
What is Synthetic Data?
Synthetic data refers to any production data not coming from direct measurement that applies to a given situation. Production data in this definition is information professionals persistently store and use for business processes.
Simply put, synthetic data is any kind of artificially generated data rather than raw information gathered from an event.
Examples of synthetic data include computer-generated data. Anonymized data is its subset.
What is Transcoding?
Transcoding refers to the process of converting a digital audio or video file into a different file format. The goal is to make the file accessible to a wide range of devices and users. For example, the WMV video file format is mostly compatible with Windows applications only since Microsoft designed it. Converting or transcoding the file into MP4 format will make it accessible to more devices, applications, and even browsers.
Transcoding allows users with fast Internet connections to view high-quality videos while making the same video available to people with slower connections, albeit at a lower quality, too.