The rise of machine learning algorithms has created a massive demand for data, but acquiring large quantities of data can be exhaustive, expensive, or impossible.
One of the most pressing issues related to data acquisition is the stringent regulatory landscape created by governments around the world. Maintaining compliance for handling sensitive data has become more challenging, and because of this, some companies are looking at synthetic data as a potential solution.
What is Synthetic Data?
Synthetic data is the concept of generating artificial datasets based on the statistical properties of an original set containing sensitive data elements. The synthetic dataset is designed to retain all of the important statistical properties of the original dataset, but without any identifying information that could be traced back to actual people.
Opting for Synthetic Data Over De-identification
Privacy regulations are a major barrier to accessing sensitive data, and many regulations require that personal data be de-identified before it can be commercialized. De-identification requires all personally identifiable information (PII) to be completely removed before a record can be deemed anonymous for use.
Unfortunately, stripping records of all PII often results in the loss of valuable context. Loss of context reduces the utility in several ways: by making it more difficult to match two datasets together using common identifiers, reducing the accuracy of computational tasks, and eliminating potential correlative features.
Therefore, de-identified data is less-than-optimal in certain situations. Furthermore, de-identified data is not sufficiently secure, as it is susceptible to re-identification attacks. Rather than dealing with vulnerable, lower-quality de-identified data, many data scientists working with machine learning models have opted for synthetic data training.
Typical Synthetic Data Generation
While there are many approaches to generating synthetic data, the use of neural networks is common and one of the most popular approaches is the use of Generative Adversarial Networks (GANs).
This method uses two neural networks working against one another. During the initial training phase, a generator neural network creates synthetic data based on an input of training data. The ‘real’ training data and the generated synthetic data are then passed to a discriminator model that attempts to distinguish the “real” data from the synthetic data.
The results of each discrimination test are then fed back to the generator network, which improves its methods for creating synthetic data. Over time, the generator network becomes more adept at producing data that closely resembles the original training data. Once the discriminator is no longer able to identify the synthetic data, it is deemed sufficient and the generator is pressed into service to work with actual data.
While this can be a very effective way to produce synthetic data, the use of GANs is not without its challenges. For one, it is difficult to know when a generator network has been sufficiently trained to produce high-quality synthetic data. This approach also has problems when it comes to including outliers and unusual data points. A GAN must be specially calibrated to handle certain types of data, such as tabular data or images.
The Benefits of Synthetic Data Generation
One of the greatest benefits of using synthetic data is having the ability to commercialize or leverage sensitive data while significantly reducing privacy and security concerns.
General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and other regulations make the use of sensitive data extremely difficult. Many companies want to commercialize or repackage their sensitive data without running afoul of these regulations.
The use of synthetic data generation offers one potential solution to unlock more possibilities for leveraging valuable data, especially machine learning model training. Because synthetic data is a representation of the real dataset but does not include any PII itself, it retains some utility of the datasets without exposing sensitive data.
Additionally, the decreased liability associated with synthetic data means more opportunities to use cloud storage tools and services. The ability to compute in the cloud removes a significant number of pain points from the process and increases the efficiency of a data partnership.
Limitations of Synthetic Data
While the ability to generate massive amounts of useful data seems very promising, synthetic data is hardly a silver bullet solution. Ideally, synthetic data would maintain privacy while being completely indistinguishable from the dataset on which it is based. However, it is possible to pass private information from the original dataset into a synthetic dataset. If the original data set contains outliers that are passed into a synthetic data set, these unusual data points can easily be identified as original data. Some methods of synthetic data generation remove outliers altogether, but this can reduce the utility of datasets when outliers contain interesting computable information. In fact, there are many cases in which the outliers are the most interesting examples, especially in healthcare.
The systems that are used to generate synthetic data are vulnerable to attacks. If hackers can gain access to a GAN, a model inversion attack can be used to uncover private data. This type of attack seeks to reconstruct training data from model parameters after the model has been trained. Some systems use highly restricted access to the model or synthetic data, but this approach is still vulnerable to a model inversion attack.
Even putting privacy concerns aside, there are significant applicability and efficacy issues with synthetic data. One major issue is the capacity of data scientists to influence the data set during the generation process. The production of synthetic data requires a number of preprocessing steps, and any assumptions made during these manual steps can have a major influence on how the original data is processed into synthetic data. The resulting biased data would then not accurately represent the original dataset, undermining the purpose of synthetic data.
Synthetic data also presents issues related to quality control. Because original data can be unique, a proper quality check procedure must be developed for each new dataset. If a third party is generating the synthetic data, the creation of a quality check would likely require sharing of information on how the data set will be used, which may require the passing of intellectual property from the client to the synthetic data provider.
TripleBlind Offers a Better Approach
While synthetic data can be a useful tool for training machine learning algorithms, TripleBlind’s privacy enhancing technology can allow for these systems to be trained on real, original data without fidelity or compliance concerns.
Our Blind Compute technology offers irreversible encryption that ensures the secure, safe sharing of real data. Our technology also provides several key advantages:
- Retains data quality. With our solution, the data being used is original data and so there are no concerns related to the loss of key data points or unique outliers.
- Supports better modeling and analysis. We provide access to a larger number of more diverse datasets. This enables superior analysis and training of machine learning models than would be possible using synthetic data.
- Supports the creation of auditable digital rights for data usage. Privacy enhancing computation prevents unauthorized use by a third party. By comparison, a third-party provider can obtain synthetic or real data for unauthorized use.
In addition to facilitating machine learning, our Blind Compute technology can be applied to a wide range of analytics use cases in industries like finance and healthcare. Users of TripleBlind’s technology can leverage sensitive data to unlock new insights while remaining compliant.
If your company is currently in the market for a next-generation privacy-enhancing solution, please contact us today to schedule a personalized demo of our revolutionary technology.
Book A Demo
TripleBlind’s innovations build on well understood principles of data protection. Our innovations radically improve the practical use of privacy preserving technologies, by adding true scalability and faster processing, with support for all data and algorithm types. We support all cloud platforms and unlock the intellectual property value of data, while preserving privacy and enforcing compliance with HIPAA and GDPR.