Tokenization
Data tokenization is a cybersecurity method that replaces specific, sensitive data with an algorithmically generated, non-sensitive token that obscures the content of the original data. Rather than representing the original data, a token serves as a distinct identifier that can be used to retrieve the data. This is different from encryption, which involves information being encoded and decoded based on an encryption key.
How Tokenization Works
Companies looking to boost security or privacy with the tokenization of data will typically use third-party data tokenization solutions. The provider stores data in a separate secured location and issues tokens to the company for its data. Usually, only sensitive data elements are tokenized, while data fields that do not contain sensitive data are left in their raw format. The idea is that if an outside attacker or company employee with malicious intent gains access to a company’s dataset containing tokens, they will be unable to access the original data behind the tokens.
When a company wants to access its sensitive data, it passes the relevant token to its security provider. The provider then uses data tokenization tools to fetch the data and pass it back to the company. The security provider’s system is the only party that is capable of reading the token, and each token is unique to the client, meaning that a provider will not use the same token for multiple clients.
Practical Applications
Commonly used in e-commerce, data tokenization provides an additional layer of security for sensitive information. The tokenization of data mostly prevents the unnecessary and risky passing of sensitive information within a company’s internal system, which can create security threats.
Tokenization came to prominence as a security technology in e-commerce, and organizations in healthcare and other industries are now giving the technology a look. These organizations are driven by a desire to embrace analytics and AI, both of which require massive amounts of data. The necessary data collection can raise serious compliance concerns. In the United States, Health Insurance Portability and Accountability Act (HIPAA) regulations state that personal healthcare data must remain private. To address this mandate, healthcare organizations have been experimenting with a number of different privacy-enhancing methods.
Most national and international regulators consider data tokenization to be compliant with their data privacy rules. In some implementations, the use of tokenization to protect patient privacy is considered compliant under HIPAA regulations. However, the use of tokenization involves a major tradeoff between the degree of privacy and the level of utility.
A typical data tokenization system for patient data anonymizes records by removing identifiable information such as names and five-digit zip codes. The de-identification process is typically configured manually on a client-by-client basis, using input from the tokenization client.
During this process, the system creates one or more tokens for each designated record so that de-identified patient records can be placed into one or more datasets. Tokens are often created based on combinations of the identifying information, dependent on the system configuration. Tokens can be constructed to support deterministic or probabilistic matching methods that link a record located in one or more datasets.
Security Challenges of Tokenization
There are two common data tokenization solutions: a tokenization vault and vaultless tokenization. Let’s take a closer look at each solution to understand the advantages and disadvantages of each.
A tokenization vault stores the original plaintext information in a file or database after generating a token. When the original value must be retrieved, a call is made to the vault using the token, and retrieval occurs. Data tokenization tools then serve up the requested data.
The most obvious issue with this approach is that it creates a copy of de-identified, valuable data and places it in another location. This practice is commonly referred to as “moving the problem,” and it results in another attack point for those with malicious intent. Also, the use of a vault presents inherent scalability issues. A tokenization vault does not function well in distributed data ecosystems. For instance, tokenization of datasets from multiple parties could require significant coordination among data partners.
Another security issue is the fact that the provider has access to its clients’ sensitive — and often valuable — data. Although it is standard practice to put firewalls and other security measures in place, a company using the tokenization services of a provider largely depends on trust and legal agreements. Companies in this situation could remove any personally identifiable information before passing it to the client’s system to avoid privacy disasters, but the remaining data could still have value for unauthorized users.
Vaultless data tokenization, on the other hand, does not store the original plaintext data in a secondary location. Rather, it uses a secure cryptographic device that maps tokens to plaintext values. Although this approach avoids creating a secondary point of attack, it is vulnerable to a plaintext or a ciphertext attack. In these kinds of attacks, a hacker produces tokenization requests with the intent of unlocking the tokenization device. These types of attack can be resource-heavy, but they’ve proven to be effective.
Performance Challenges of Tokenization
In addition to having privacy and security vulnerabilities, data tokenization also presents challenges related to use and performance.
One of the most important disadvantages is the cumbersome configuration process required to de-identify records. Configuration processes must be performed whenever a new data partner is added. Sometimes, the addition of a new dataset by an existing data partner will also require a significant configuration process. Adding steps between the generation and use of data slows down the time-to-insights and can lead to some datasets essentially expiring in terms of usefulness before they are ever leveraged.
Furthermore, an aggressive de-identification configuration can strip a dataset of critical information. The approach of using tokenization or other anonymization techniques inherently leads to data degradation, as the datasets lose precision when information is stripped away and replaced with tokens.
Other performance challenges include:
- Lack of digital rights management. Tokenization clients have no assurances that their data won’t be used for unauthorized purposes.
- Lack of token security. Tokenization is not impervious to re-identification. Sophisticated malicious actors are still able to use context and computational techniques to re-identify anonymized patients.
- Data residency concerns. Because tokenization requires the movement of data to a third party, compliance with data residency laws is a concern that the technology does not address.
A Better Approach: Privacy Enhancing Computation
TripleBlind’s groundbreaking encrypted-in-use approach, built on practical breakthroughs in decades of trusted, verified research in cryptography and mathematics, avoids many of the issues associated with tokenization.
Our Blind Compute technology does not involve hiding, removing, or replacing data. This solution maintains full data fidelity, resulting in more accurate computational outcomes. Its key features include:
- A superior ability to process highly-complex data. In addition to processing text, images, voice recordings, and video, our privacy enhancing technology provides the ability to process complex genomic and unstructured data without loss of fidelity.
- Security and privacy protections for all parties. Our technology allows all data partners to safely provide and process sensitive data in encrypted space, protecting both the data and processing algorithms.
- Strict preservation of data ownership. Our system is based on the use of one-way, one-time encryption keys, with a new key generated for each algorithm access. This provides a cryptographic guarantee that data can only be used for authorized purposes. With companies keeping their data in-house, it also addresses any data residency issues.
TripleBlind’s software-only API addresses a wide range of use cases, allowing for the safe and secure commercialization of sensitive data. If you would like to learn more about how Blind Compute offers a number of advantages over tokenization, contact us today to schedule a demo.
Book A Demo
TripleBlind is built on novel, patented breakthroughs in mathematics and cryptography, unlike other approaches built on top of open source technology. The technology keeps both data and algorithms in use private and fully computable.