Data is everywhere these days, and training machine learning (ML) models can unlock tremendous value. However, companies must be more aware of where their data sits, what would happen if it was compromised, and who has access to it.
The data used to train ML models may contain sensitive information, which can be valuable to hackers and other nefarious entities. Not knowing where your training data is means you don’t know how it’s stored, who has access to it, and whether or not it’s secure. All that puts your training data at risk of a potentially serious data breach.
The Hidden Risks of Poor ML Data Management
The challenge with data management is that it’s not always clear what data experts can and cannot do with training data.
It’s an even bigger problem if the data is publicly available. After all, there are plenty of reasons why data experts might want to use other people’s datasets for projects, like those for benchmarking purposes. However, this doesn’t mean that all public datasets are created equal. And it also doesn’t mean that all data can be used in ML research.
For example, an expert may use one type of dataset to test how well a model performs, but a different dataset entirely to train their own models from scratch. These two uses constitute different degrees of “reuse” (in the ML parlance). Each can have diverse implications on an organization’s or individual’s ability to control access to the sensitive information contained within that set of data.
How to Track a Dataset’s Lineage
A data scientist must always know where their data used for training is and where it comes from. This knowledge is critical to understanding the context of the data and how to use it.
It’s important to track provenance from the data source until one uses it for training. Provenance aspects to track include metadata, such as its name, description, version information, and location in storage.
Over time, datasets often get reused and expanded. Sometimes this is done by other teams or even outside entities. Having an end-to-end lineage of this evolution allows one to reference back if needed for regulatory compliance processes easily.
Tracking lineage may also be useful if there is a need to perform an audit on a dataset due to business strategy or policy changes. If your entire dataset is self-contained, meaning it’s all within your organization’s control, then tracking its lineage is easy. But if your dataset is gathered, processed, stored, queried, or modified outside your organization, manual lineage tracking just isn’t feasible.
There are many automated digital tools designed to continuously monitor, tag, and validate your dataset so you can always be sure where it comes from, where it goes, and where it’s stored.
How to Make Sure Datasets Stay Secure
A perfect example is when data scientists are training an artificial intelligence (AI) system to do something. The AI system thus gets smarter and better at performing tasks.
But in doing so, the AI system is learning more and more about what an ideal outcome should be. Successful data training can be good for the task and bad for the data experts because it may eventually leak information about things they weren’t expecting.
According to a recent study by IBM, the average cost of a data breach was 4.24 million in 2021. The report also found that financially motivated attacks are increasing in frequency. But what can data experts do to secure their training data?
Understand What’s at Hand
Data scientists need to take stock of their available datasets before starting any analysis or ML project. They should look at the dataset’s location and how the data was generated. The experts should be aware of those datasets’ intended use and whether or not they contain sensitive information.
Identify All Stakeholders Impacted by the Dataset
It’s important to know what business unit or team created the dataset and who owns it. Are there any legal ramifications if it’s shared outside the organization?
Understand the Security Risks for Each Dataset
Data experts must understand the security risks for each dataset before doing any work on it. They should know what information could be compromised if lost and determine how that could affect the company.
Enable Data Masking
Data masking is the process of hiding sensitive data by transforming it into another value. The masked data remains usable for applications and analysis, but all personally identifiable information (PII) is protected.
Data masking doesn’t require any changes to existing workflows or systems. It is significantly easy to implement. And since the original values stay in a separate location, there’s no risk of losing data.
Data Architecture Best Practices
There are certain best practices that every data scientist should follow to make sure the information stays secure within their organization.
For example, data scientists should only use approved storage solutions for datasets. They should be careful about who has access to those datasets.
In addition, data experts should make sure that the IT department has approved every software they’re using.
All employees should practice proper password hygiene. Stringent access control helps reduce the risk of employees leaving the organization open to cyber attacks.
Finally, if you’re outsourcing data labeling and annotation services, make sure that the company you’re hiring has adequate security measures in place. That could include screening the annotaters involved in the project and providing a secure work environment.
To Recap: Security Starts by Understanding Data Lineage
It’s incredibly important for anyone involved in data modeling to take stock of where their ML data is stored.
Anyone working with data should be able to see where it is. They should understand who has access to it and what model training or inference processes are used. It’s also important to know how long ago those processes were run.
That way, it is possible to avoid security threats while also boosting compliance with industry regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).