This article explains the tokenization, salting and hashing process in Privacy-Preserving Matching. If you are unfamiliar with Privacy-Preserving Matching please review An introduction to Data Republic's Privacy-Preserving Matching feature.

In this article, you will learn about:

A high-level overview of the end-to-end Privacy-Preserving Matching process

When it comes to PII management, Privacy-Preserving Matching means that PII never leaves your organization. Tokenization and hash slicing are performed before datasets are available for exchange on Data Republic.

Watch this video for insight into the tokenization, salting and hashing process.

Tokenization of personally identifiable information

Contributors upload customer data into a Contributor Node behind their firewall. The node assigns a randomised token and returns the token to the contributor. These tokens are downloaded as a file to be used for confirming a match on the Data Republic platform. 

The token is completely random and does not reference the PII fields. 

The salting and hashing process

After the tokenization process, PII fields are salted. We append a salt which is a random string, different for each field. Salt is appended to end of plain-text field prior to hashing. The salt value is distributed by the Privacy-Preserving Matching network and known only by Contributor Nodes. 

The salted PII fields are then hashed (encrypted in a way that cannot be reversed), and shredded into small parcels of characters within the Data Custodian’s IT environment (via a Contributor Node). Each slice of hashed is then sent to a different Matcher Node for later matching. A network of Matcher Nodes within trusted organizations in Data Republic’s ecosystem then allows for decentralised matching calls and token retrieval. 

Slicing and distributing hash fragments

The problem created by building an entire hash is the risk of re-identification. The solution that Privacy-Preserving Matching offers is the data slicing and distribution of hash fragments across the node network. We need a high chance of ‘collisions’ – false positive matches where multiple email addresses will have the same hash slice inside a given matcher node.

By using Privacy-Preserving Matching hashes are split into 16 slices of 32-bits each. Slices are sent to pre-allocated matcher nodes. For example, all the “part 1 of email address” slices from all Contributors go to “matcher node 1”. 

Not all the slices are used. This is a security feature – it means that if the hash slices ever leak it is impossible to reconstruct the entire hash. 

Privacy-Preserving Matching uses variable length slicing meaning different contributors will use different sized slices, depending on how many unique customers they have. This is to guarantee false positive matches inside a matcher node to control re-identification risk

What is hashing and what method is used to tokenize PII?

Hashing is a one-way cryptographic function, that sees PII combined into a single string of seemingly random letters and numbers. SHA-512 is the secure hashing method that is used by the Privacy-Preserving Matching feature. 

SHA-512 is fast to compute, well understood, and supported by a wide variety of programming languages. SHA-512 is preferred over SHA-256 because of its greater resistance to certain kinds of advanced attacks.

We also add salt (random data) to the hash to defeat a common kind of attack against hashing called a Rainbow Table, where an attacker pre-calculates millions of hashed values to use as a lookup table if they ever come across a hash value and want to reverse it. Note that this attack is already greatly weakened by the fact that the Privacy-Preserving Matching feature split hashes into slices and sends them to different Matcher Nodes across the Data Republic Matcher Node Network.

Download Privacy-Preserving Matching Whitepaper
Download Privacy-Preserving Matching Technical Whitepaper
Privacy-Preserving Matching User Guide
An introduction to Privacy_Preserving Matching

Did this answer your question?