Data Republic does not allow personal information (PII) of customers or individuals to enter the platform.
In order to facilitate analysis using individual records, Data Republic has developed privacy-preserving technology. Data Republic’s Privacy-Preserving Matching feature ensures customer PII never leaves your organization’s environment when conducting projects on Data Republic. The Privacy-Preserving Matching feature removes risks of re-identification and misuse, while still allowing accurate insights to be derived from matched datasets.
Before uploading a dataset for exchange, Privacy-Preserving Matching nodes first divide the data into the fields that are not to be shared (the PII) and fields that can be shared (such as attribute data). This division means that the companies you license data to (and Data Republic) never have access to your customer PII. Datasets are anonymized using tokens, and PII is protected using salting and hashing. But Privacy-Preserving Matching goes further by dividing the hashed PII and distributing the slices on a network of highly secured nodes.
Privacy-Preserving Matching then uses a sophisticated technique to accurately match the individual records of the datasets, so data analysts can be confident in the quality of the matching and the data itself. And because the hashed PII is sliced and distributed, it can’t be re-identified.
- Privacy-Preserving Matching Whitepapers
- How does Privacy-Preserving Matching work?
- Managing matching projects on the Data Republic platform
Privacy-Preserving Matching Whitepapers
In today’s digital environment, data matching is the engine which drives everything from loyalty programs, to credit scores, fraud prevention and providing online customer service. When organizations prepare datasets to perform matching, customer personally identifiable information (PII) can be put at risk of being lost or stolen through common hashing and encryption methods which rely on some form of PII being present or referenced.
Data Republic’s Privacy-Preserving Matching feature revolutionizes customer privacy protections when matching data by providing a more secure, decentralized alternative to common hashing or encryption methods.
Read the Privacy-Preserving Matching Whitepaper to learn more about:
- What sets Data Republic’s Privacy-Preserving Matching apart
- The role of data matching
- Current approaches to PII protection
- How Privacy-Preserving Matching works
How does Privacy-Preserving Matching work?
Privacy-Preserving Matching is Data Republic’s service for privacy-preserving record linkage (PPRL). Using this service, Data Custodians generate randomized tokens to replace Personally Identifiable Information (PII) in data uploaded to the Data Republic platform. This ensures that data in the Data Republic platform is not directly tied to an individual’s identity.
Using Privacy-Preserving Matching, organisations can de-identify data, while still preserving the capability to match records between de-identified records. Matching of datasets with tokens only occurs through authorized match requests approved by Data Custodians.
Privacy-Preserving Matching features three types of virtual machines in it's architecture:
Tokenization, hashing, slicing and distribution of PII
The Contributor Node is the technical component of the Privacy-Preserving Matching service which generates tokens and hashes and slices PII.
There are two contributor nodes for each match; one your organisation and one for the organisation you intend to match with. Both data custodians install a contributor node in their organisation’s environment. This ensures that PII never leaves the organisation. The Data Custodians upload data sets into the Contributor node. All PI is hashed prior to being sent to the node (and is also encrypted in transit via SSL when being sent to the node). The node assigns a randomised token for each customer record, and returns the token back to the contributor. The PII is cleansed, salted and hashed into parts which are distributed amongst the Matcher Nodes.
The Contributor Node is a virtual machine image provided by Data Republic, that the contributor runs inside their own IT environment. The expected size of your customer database will inform the minimum system requirements for your Contributor Node installation.
As the Contributor Node is run entirely on the custodian's system and hardware (or, for example, on a cloud-based storage solution operated by the custodian), the raw dataset is never transferred outside of the custodian’s environment, nor does Data Republic have access to the Contributor Node as it sits behind the custodian’s firewall.
Contributor Node process
The first step in the process for using Privacy-Preserving Matching is to upload data with PII into the Contributor Node. The Data Custodian (the authorized person from the contributing organization) extracts the customer data from a database or CRM and uploads it into their organization’s own Contributor Node.
The raw dataset uploaded to the Contributor Node by the custodian will include PII and may include the following details:
- name (given name, and family name);
- date of birth;
- phone number (mobile, home and/or work);
- email address;
- gender; and
- a “natural key”, which is the customer identifier from the source database and known only to the custodian’s organization.
Salting and hashing of PII fields
All PII fields are actually salted and hashed before being sent to the Contributor Node:
- If using the provided Web UI, the hashing is performed in the browser, prior to the browser application calling the Contributor Node API.
- Otherwise, the Contributor performs the hashing step on their own systems, using the specifications provided by the Contributor Node API.
Salt values are distributed by the Consul, the Privacy-Preserving Matching configuration service. Salt values are unique to each field name and are randomly generated 128-bit values. The hash algorithm is SHA-512. SHA-512 is preferred over SHA-256 because of its greater resistance to certain kinds of advanced attacks (see Comparison of SHA functions). Each field is associated with a simple normalization function, which is used to ensure that the same PII value can be matched after hashing, even if represented slightly differently between Custodians (e.g. emails with upper or lower case letters).
Variable bit-length slicing
The Contributor Node slices each PII hash into 32 pre-allocated slices. A predetermined selection of these slices is mapped to a different Matcher Node. Any slice not mapped to a Matcher Node is discarded. The hash slices may be further shortened, in order to ensure that there is a probability that some slices will have “collisions” within an individual Matcher Node. A collision is when two different inputs (e.g. two different email addresses) have the same value for a particular slice.
Decentralized storage of hashed and sliced PII fragments
The Matcher Node is the technical component in the Data Republic Matching network that stores hashed splices of PII during the tokenization process. This means that no one Matcher Node can contain an entire hashed field value for PII. Even if a matcher node is compromised, only a fragment of a hash could be extracted, significantly reducing the risk of exposure. When a request for matching is made, the Matcher Node compares hash splits for each token and returns Token pairs to an Aggregator Node.
Matcher Node process for managing hashed PII
Once the PII has been prepared using the Contributor Node, the hashed data is sliced into a number of “slices”. Some of these slices are distributed to Matcher Nodes, at which point the original hash values are discarded by the Contributor Node.
- In Privacy-Preserving Matching, each PII field will have a hash value 512-bits long. Each of these hashes will be individually sliced into 16 32-bit slices. A pre-determined subset of these slices is distributed and stored across various nodes which are hosted on Data Republic’s Matcher Nodes. The “full” hashes are discarded once the slices have been distributed.
- A Matcher Node will only have an encrypted token which relates to a particular individual within a particular dataset and does not have the relevant key to decrypt the token. The encryption key for the token is unique to each combination of Contributor Node, token database, field and slice.
- The slices selected for distribution are sent to their pre-allocated Matcher Nodes, which means, by way of example, that all of the "part one" slices of email addresses from all Contributors will be sent to "Matcher Node 1". Unallocated slices are permanently discarded.
- A Matcher Node may receive multiple slices, but never from the same field. So “Matcher Node 1” might receive slice 1 of all email addresses, and slice 3 of all phone numbers. These slices are then stored separately, they are not concatenated or combined. The associated tokens are encrypted with different keys for each field slice, so it is not possible for a Matcher Node to know which slices from the different fields belong to the same token.
- The slices sent to Matcher Nodes may be shorter than 32 bits, because Privacy-Preserving Matching is designed to guarantee a false positive error rate inside each Matcher Node. This means that each slice within the Matcher Node does not have enough uniqueness to identify an individual on its own.
- Privacy-Preserving Matching utilizes "variable length slicing" which means that different Contributors will use slices of different sizes, depending on the size of the dataset uploaded by that Custodian (i.e. the number of unique customers). The Matcher Node is still able to match slices of different lengths, guaranteeing false positives and further reducing re-identification risk.
Critically, the steps above are carried out entirely on the Data Custodian’s systems, and at no time does Data Republic receive or process any PII from the raw dataset that the Custodian uploaded. In the case of the Matcher Nodes, no party (not even Data Republic) is able to reconstruct the full hash of the original PII, even in the event that the salt value is known by an attacker.
Executes match requests and filters results from Matcher Nodes
As soon as a match request is authorised in the Data Republic platform, an Aggregator Node communicates with the Matcher Node to generate lists of token pairs that may match.
Finally, the aggregator node filters out false positives and provides a final match table to the Data Republic platform. The Data Republic platform loads a masked version of the token pair table into a Workspace for analysis.
- The Aggregator Node is operated by Data Republic in the same secure environment as the Data Republic platform hosting the data project. Therefore, Singapore based data projects use an Aggregator Node hosted in Singapore, and Australian projects will use a node hosted in Sydney, Australia.
- The Aggregator Node has the list of Matcher Nodes that exist in the network and their network addresses, in the form of domain names. The Aggregator Node connects to the Matcher Nodes using encrypted HTTPS plus a private certificate to prove that it is the authorized Aggregator Node for that region (known as twoway Security Sockets Layer, or Transport Layer Security).
- The Aggregator Node sends to the Matcher Nodes the relevant database account numbers (globally unique identifiers (GUID)), which are not encrypted but which are randomly generated and which, by way of example, might take the form, "2cacf2feffb1- 404c-8617-b7e5df12301b" for the two Contributor databases which are to be matched.
- The Matcher Nodes will then begin the process of comparing the matching “slices” and reply with a list of potential token matches (potential because the Matcher Nodes do not know which might be false positives). These tokens are encrypted, and the Matcher Nodes do not have the key. The Aggregator Node, which does have the encryption keys, can decrypt these token pairs and filter out any false positives, which it does based on "votes" received from the Matcher Nodes as to whether two tokens should be paired.
- Finally, the Aggregator Node returns the list of token pairs to the Data Republic platform, allowing Data Republic to assemble the full encrypted and matched attribute dataset that the Analyst has requested.
The result of this process is a “token matching table” which consists of a list of Token ID pairs (one token from the first Contributor, paired with one token from the second).
Managing matching projects on the Data Republic platform
A data Match using Privacy-Preserving Matching can be requested via an approved data license in the Data Republic platform.
For more information on how to create a data license please see Creating and Approving a Data License.