Data perturbation protects information by adding “noise” to a database to render individual records unreadable to unauthorized users. Noise, in this case, could be anything that interrupts data transmission or communication by corrupting signal quality. Only authorized users can do away with the noise to understand the information sent.
Data perturbation is typically applied to electronic health records (EHRs) to protect sensitive information from prying eyes. Its use, however, is not limited to the healthcare industry.
Read More about “Data Perturbation”
Apart from hospitals and other healthcare service providers, data perturbation can protect the data that any organization that uses industrial control systems (ICS) transmits. Examples of ICS include machines in nuclear power plants, energy plants, and others that make industrial processes run. When hit by a cyber attack, they are likely to shut down an entire electric grid causing a massive blackout, for instance.
What Are the Types of Data Perturbation?
Data perturbation can be of two basic types—the probability distribution approach and the value distortion approach. How do these differ?
The probability distribution approach
The probability distribution approach takes the data and replaces it from the same distribution sample or the distribution itself. In a database that contains a patient’s name, address, phone number, and historical medical information, for example, the sender can scramble the patients’ names so they won’t match the details. He/She then gives the code to unscramble the patients’ names to only the intended receiver. That way, even if the database gets stolen or lost, nobody but authorized users can decipher the actual database’s content.
The value distortion approach
The value distortion approach, meanwhile, uses several additive noises or other randomization processes. It uses decision tree classifiers to assign each noise type to a database element if it meets specific criteria. Each data point can thus have many noises added to it.
If it still sounds too technical for you, it may be worth watching this video:
What Do You Need to Decipher Perturbed Data?
Companies that use data perturbation need to apply data mining to return perturbed data into its original form. Data mining typically requires specialized software that find patterns in massive data sets to make them readable or understandable.
Data Perturbation Tool #1: Weka 3.8
In 2019, researchers Ajmeera Kiran and Dr. D. Vasumathi presented Weka 3.8 to their peers. The tool specifically preserves the confidentiality of data through a random swapping method. Here is a flowchart of how Weka 3.8 can help users scramble confidential information:
With Weka 3.8, sensitive information in the original data set is swapped with one another to preserve privacy. The algorithm then determines how the data set was randomized in the last step so the recipient would know how to put the data set back to its original state.
Data Perturbation Tool #2: iHiMod-Perturb
Another data perturbation tool that healthcare organizations can use is iHiMod-Perturb, also publicized in 2019. The algorithm works very similar to Weka 3.8 in that it protects sensitive information from landing in the hands of malicious users. It is more complicated than Weka 3.8, though, because it uses a decision tree.
While several proponents of using data perturbation to protect sensitive information vouch for its effectiveness and ease of use, critics say that random additive noise can be filtered. When that happens, data breaches and privacy compromise can still occur. But given a choice between adding another layer of protection to confidential information through data perturbation or nothing, we’re more likely to tell organizations to go for it.