What is Privacy-Preserving Matching?

Privacy-Preserving Matching is Data Republic’s feature for matching datasets without exposing personally identifiable information (PII). Using this service, Data Custodians generate randomized tokens to replace PII in datasets uploaded to Data Republic. Matching of these randomized tokens only occurs through authorized match requests approved by Data Custodians on the platform.

With Privacy-Preserving Matching PII never leaves your organization. Tokenization and hash slicing are performed before datasets are available for exchange on Data Republic. Privacy-Preserving Matching also helps organizations ensure that token matching is never centralized. Rather than having tokens, hashes and keys centrally stored, Privacy-Preserving Matching decentralizes the information to prevent inference attacks.

What is Privacy-Preserving Matching?

  • Privacy-Preserving Matching is the process of identifying common individuals across datasets (belonging to one or more organizations).

  • Privacy-Preserving Matching projects can reveal valuable insights or attributes about consenting customers that organizations may not otherwise have access to.

  • Privacy-Preserving Matching projects can help an organization to enrich their customer data, unlock behavioral patterns for targeted demographics or reveal a targeted audience.

How does Privacy-Preserving Matching work?

No personally identifiable information (PII) is allowed onto the Data Republic Platform. Any PII belonging to an individual for data matching will be replaced with a random token generated by each Data Custodian's (data owner's) Contributor Node.

  • Data Custodians prepare data by uploading it into the Contributor Node which strips personal information fields and replaces identifiers with randomly generated tokens.

  • The PII then undergoes a process of non-reversible hashing, salting and slicing before being distributed to nodes on the Privacy-Preserving Matching Network for future match requests.

  • Using Data Republic's Privacy-Preserving Matching feature, data matching on individuals can then be performed against tokens uploaded by Data Custodians.

  • Once matching is complete, a token-pair match table is available. The table will display which two tokens have matched, and which fields they have matched on. For example, token A and token B matched on full name and mobile number, whereas token C and token D matched only on full name. Data matching can be performed on a range of PI fields available.

Using this feature, customer personal information never leaves the data owner organization in its complete form. Analysts in workspaces never receive PII, only tables which indicate where a match has occurred against one of their own tokens. The decentralized matching methodology also makes it virtually impossible to re-identify a customer.

Privacy-Preserving Matching allows Data Republic users to realize the true potential of collaborative customer analytics while protecting customer PII and privacy.

What are the legal prerequisites for Privacy-Preserving Matching?

  • Customer consent and data custodian consent is required for any data matching activity on Data Republic. Data Custodians must have applicable privacy policies and collection statements in place before commencing a data match.

  • A legal framework for data sharing must be in place between organizations intending to complete a data share. To enable time-to-value, Data Republic can provide organizations with a Common Legal Framework to adopt and agree to (as opposed to creating bespoke legal frameworks for each data share). The framework outlines each organization's role and responsibility in a data share depending on whether they choose to act as a data recipient, data contributor or data product developer.

  • A data sharing agreement (i.e. Data License) can then be executed by organizations once a legal framework for data sharing is agreed. Organizations will usually negotiate and draft Data License terms together. The Data License on platform specifies important terms such as those related to access, data use, and what kind of outputs can be taken off platform.

Benefits of Privacy-Preserving Matching

De-identification, PII protection and safe matching

Match datasets without PII ever leaving your organization. Ensure privacy compliance and protection when sharing data.

Decentralized architecture

Tokens and hashes are never centrally stored, removing the risk of re-identification.

Quarantined analytics environments

Secure collaborative development environment for data collaboration. Provision and analyze matched data in secure, encrypted, cloud analytics environments on Data Republic.

Transparent multi-field matching

Analysts are shown which fields are matched, leading to increased match rates and improved confidence in match results.

Privacy-Preserving Matching FAQs

What is Privacy-Preserving Matching?

Privacy-Preserving Matching is Data Republic’s feature for Privacy-Preserving Record Linkage (PPRL). Using this feature, Data Custodians generate randomized tokens to replace Personally Identifiable Information (PII) in data uploaded to Data Republic. This ensures that data in Data Republic is not directly tied to an individual’s identity.

Using Privacy-Preserving Matching, organizations can de-identify data, while still preserving the capability to match records between de-identified records.

Matching activities are governed as part of projects on Data Republic and require both the consent of the Data Custodian as well as an approved license.

What sets Data Republic's Privacy-Preserving Matching apart?

  1. Private by Design - Unique privacy-preserving technology allows organizations to link /match records without customer personal information needing to leave secure organizational environments.

  2. Control over data assets - Data Republic’s governed access workflows ensures that Data Custodians retain full control and enforceable rights over who can access the data and how it is used through the data licence.

  3. Control over uses of data - All match requests and proposed matching applications are subject to Data Custodian consent and an approved license. Analysts in workspaces never receive PII, only tables which indicate where a match has occurred against one of their own tokens.

  4. Decentralization of Data - Data Republic’s Privacy-Preserving Matching feature uses a decentralized solution to ensure that there is no 'token honeypot' or single point of vulnerability. Even in the unlikely event of a matcher node breach, only small token slices would be recoverable, with no means for these tokens to be reconstituted in order to pose a re-identification risk.

Where does my data go? Who can access my PII?

This decentralized process ensures that PII does not leave a contributors environment when preparing data for Data Republic projects. The original PII data remains within a contributor’s internal environment (aka behind their firewall). No one will be able to access your PII, not even DR, as PII does not go onto Data Republic.

Authorized analysts in workspaces, who have received approval to conduct a matching project, never receive PII, only tables which indicate where a match has occurred against one of their own tokens.

How does Privacy-Preserving Matching Work?

Privacy-Preserving Matching Process

Privacy-Preserving Matching software called a Contributor Node is installed behind your organization’s firewall.

Data Custodians prepare data for Data Republic projects by loading it to the Contributor Node. The node assigns a randomized token for each customer record, and returns the token back to the contributor. Data is cleansed by the contributor node by normalising the PII (e.g. email is converted to lowercase and spaces are removed), and fix common formatting differences. The node then salts and hashes the PII fields, divides the hash value into slices, discards bits from those slices, then distributes these small slices and their corresponding token amongst matcher nodes. Tokens are encrypted with keys unique to each Contributor, with a different key being used for each slice.

Matching activities are governed as part of projects on Data Republic and require both the consent of the Data Custodian as well as an approved license. Matches are performed through a series of calls across the matching network where the hashed slices are verified.

Even if a matcher node is compromised, PII is not recoverable. Only the fragment of a hash could be extracted, with no way to tie that to an identity. On average, a single hash slice will collide with 30 unrelated tokens. In privacy research, this property is referred to as “K-Anonymity”.

Privacy-Preserving Matching Workflow Diagram:

What is a Contributor Node?

The Contributor Node is the technical component of the Privacy-Preserving Matching feature that generates Tokens, hashes PII, and distributes hashed splices to Matcher Nodes.

The Contributor Node is installed within the Data Custodian’s own IT environment so that PII never leaves the Data Custodian. PII data can be uploaded via the Contributor Node to be tokenized. However, no PII data is stored within the Contributor Node.

Do Privacy-Preserving Matching Contributor Nodes support SAML integration?

Yes, Privacy-Preserving Matching does currently support SAML integration as a beta feature.

Do Privacy-Preserving Matching Contributor Nodes support 2FA?

Support for 2FA is intended to be delivered by supporting SAML integration with an organisations own Identity Provider (IdP). As long as the IdP supports multi-factor authentication, then you will be able to use it with Privacy-Preserving Matching.

What is the possibility to link the slices back to the Contributor Node?

In order to link slices back to a specific Contributor an attacker would need to know the Data Base UUIDs of the target Contributors, as well as have access to one or more Matcher Node databases. There is currently no known way to do this. In any event:

  • The hash slice values would be too short to identify an individual;

  • The token values are encrypted with a different key on for each contributor and matcher node, so hash slice values cannot be joined between matcher databases;

  • A Contributor can delete a token database at any time and purge all data, including the Data Base UUID values, from the matcher network.

How can you stop un-authorized Contributor Nodes seeding incorrect matching data?

The Contributor Node's TLS certificate authorizes the node to update only the token databases associated with the contributing organization. In addition, token databases are identified by randomly generated UUIDs, which are not known to other contributors.

How do you patch Contributor Nodes after an update? How often do you patch? Do you tell me how to patch because it is in my environment?

Update notifications are distributed via email and the Data Republic Help Centre. Release Notes include instructions for applying the latest update, as well as information about the changes that have been made and if any security-related patches are included.
Data Republic will not automatically apply patches to software running in your environment. The instructions will advise you how you can apply required changes.

What is a Matcher Node?

The Matcher Node is the technical component in the Data Republic Matching network that stores hashed splices of PII during the tokenization process. This means that no one Matcher Node can contain an entire hashed field value for PII.

When a request for matching is made, the Matcher Node compares hash splits for each token and returns Token pairs to an Aggregator Node. The Aggregator Node will retain matched tokens common to all Matcher Nodes and provide this to the requestor in the form of a token match table, allowing users to perform their own table join for matches.

What data gets sent to the Matcher Nodes?

Each customer record consists of one or more fields, and each field consists of multiple slices. Slices are distributed amongst the different matcher nodes. For each slice, the Contributor node will send:

  • An encrypted token (token is encrypted with a different key for each contributor and matcher node);

  • UUID of the Contributor's token database (randomly generated - it does not identify the contributor);

  • Hash slice value (the portion of the hash, may be 6-17 bits long depending on database size);

  • A field identifier for this hash slice (e.g. "field 1, slice 2");

  • The bit length of the hash slice.

How do the Privacy-Preserving Matcher nodes authenticate with each other?

All Privacy-Preserving Matcher nodes connections are mutually TLS authenticated

Is there certificate-based authentication between Contributor Nodes and Matcher Nodes?

Yes. Contributor Nodes have their own certificate signed by Data Republic and this is checked by the Matcher Nodes when establishing the SSL/TLS link.

Can we have a private Matcher Network?

Private Matcher networks are being reviewed and scoped, but not yet available.

In the interim, Contributors may want to consider agreeing and using their own salt values. Since data can be pre-hashed prior to uploading to a Contributor Node, a Contributor could choose any salt value they wanted. This value would only have to be known by the other Contributors that intend to match.

This would prevent the possibility of matching with non-participating partners, but this may be a trade off worth making for some data custodians.

How does the tokenization and hashing process work?

Watch a short summary of the Privacy-Preserving Matching tokenization Process

After the data is uploaded into the Contributor Node, it will generate a randomized token. A token is a single string of random letters and numbers. This token is not derived from the data itself. Tokens do not hold any PII and can be appended to attribute datasets on Data Republic, allowing approved datasets to be matched at an individual level without the use of PII on Data Republic. Data Custodians may choose to download the tokens if you wish to later re-identify the tokenized individuals (subject to consumer consent and your privacy policies).

The PII fields are then prepared for hashing. For hashing, the algorithm used is SHA2 512 as it is cryptographically strong and resistant to known attacks against older hashing algorithms. Salt, or a random string different for each field, is appended to the end of the plain text field prior to hashing. The salt value is only distributed to Privacy-Preserving Matching by known Contributor Nodes. Alternatively, contributors can agree their own salt values.

Next is the process of slicing, shortening and distribution of hash fragments against the known network. Hashes are split into 16 slices, containing 32 bits each. The Contributor Node will then make two calculations:

  1. How short should a slice be to increase the chances that a hash slice will collide (have the same value as) a large number of completely unrelated individuals?

  2. How many slices should be sent, to allow the Aggregator Node to filter out all the false matches?

In a typical scenario, an individual hash slice might be 12-14 bits long, with 4-5 slices sent to the matcher nodes. The rest are discarded.

Because the tokens are also encrypted (with a different key for each Contributor and each slice) and because much of the original hash value is discarded, it is not possible to “put the hash back together.” If fragments are ever leaked, the PII can not be reconstructed, nor can an individual be re-identified.

What is data matching?

Data matching is the process of record linkage across multiple datasets by comparing shared data. Privacy-Preserving Matching is designed to facilitate this kind of matching, but without either party ever having to disclose the contents of the shared data (usually, customer PII). Since Personally Identifiable Information (PII) is not allowed on Data Republic, data matching is performed against tokenized datasets made available by Data Custodians (data owners) on Data Republic. All matching activities are subject to customer consent and Data Custodians must have the applicable privacy policies and collection statements in place.

What is Personally Identifiable Information (PII)?

Personally Identifiable Information (PII) is any information that can be used to reasonably identify a single person. This may include data such as an email address, street address, driver's license number, phone number, and social security number. However, it may also include information that does not directly reference a person but could be used to re-identify someone when combined with other details. For example, IP addresses, location history or employment history.

What is a token?

A token is a single string of seemingly random letters and numbers generated by the Privacy-Preserving Matching feature for each person. Tokens do not hold any PII and can be appended to attribute datasets on Data Republic, allowing approved datasets to be matched at an individual level without the use of PII on Data Republic.

How are tokens used in data matching?

By adding tokens to de-identified datasets on Data Republic, tokens across two datasets can be matched in a secure collaborative development environment without risking PII. The results of a data match can be used to build matched data products on Data Republic that will help organizations to enrich customer data views, unlock behavioral patterns for targeted demographics or reveal a targeted audience.

Did this answer your question?