What is Senate Matching?
Senate Matching is Data Republic’s service (SaaS) for Privacy Preserving Record Linkage (PPRL). Using this service, Data Custodians generate randomized tokens to replace Personally Identifiable Information (PII) in data uploaded to Senate. This ensures that data in Senate is not directly tied to an individual’s identity.
Using Senate Matching, organizations can de-identify data, while still preserving the capability to match records between de-identified records.
Matching activities are governed as part of projects on the Senate platform and require both the consent of the Data Custodian as well as an approved license.
For more information please see, What is Senate Matching?
What sets Senate Matching apart? Why should we use Senate Matching?
- Private by Design - Unique privacy-preserving technology allows organizations to link /match records without customer personal information needing to leave secure organizational environments.
- Control over data assets - Senate’s Governance Framework ensures that Data Custodians retain full control and enforceable rights over who can access the data and how it is used through the data licence.
- Control over uses of data - All match requests and proposed matching applications are subject to Data Custodian consent and an approved license. Analysts in workspaces never receive PII, only tables which indicate where a match has occurred against one of their own tokens.
- Decentralization of Data - Data Republic’s Senate Matching Platform uses a decentralized solution to ensure that there is no 'token honeypot' or single point of vulnerability. Even in the unlikely event of a matcher node breach, only small token slices would be recoverable, with no means for these tokens to be reconstituted in order to pose a re-identification risk.
Where does my data go? Who can access my PII?
This decentralized process ensures that PII does not leave a contributors environment when preparing data for Senate projects. The original PII data remains within a contributor’s internal environment (aka behind their firewall). No one will be able to access your PII, not even DR, as PII does not go onto the Senate Platform.
Authorized analysts in workspaces, who have received approval to conduct a matching project, never receive PII, only tables which indicate where a match has occurred against one of their own tokens.
How does Senate Matching Work?
Senate Matching Process
Senate Matching software called a Contributor Node is installed behind your organization’s firewall.
Data Custodians prepare data for Senate projects by loading it to the Contributor Node. The node assigns a randomized token for each customer record, and returns the token back to the contributor. Data is cleansed by the contributor node by normalising the PII (e.g. email is converted to lowercase and spaces are removed), and fix common formatting differences. The node then salts and hashes the PII fields, divides the hash value into slices, discards bits from those slices, then distributes these small slices and their corresponding token amongst matcher nodes on the Senate platforms. Tokens are encrypted with keys unique to each Contributor, with a different key being used for each slice.
Matching activities are governed as part of projects on the Senate platform and require both the consent of the Data Custodian as well as an approved license. Matches are performed through a series of calls across the matching network where the hashed, token slices are verified.
Even if a matcher node is compromised, PII is not recoverable. Only the fragment of a hash could be extracted, with no way to tie that to an identity. On average, a single hash slice will collide with 30 unrelated tokens. In privacy research, this property is referred to as “K-Anonymity”.
Senate Matching Workflow Diagram:
What is a Contributor Node?
The Contributor Node is the technical component of the Senate Matching service that generates Tokens, hashes PII, and distributes hashed splices to Matcher Nodes.
The Contributor Node is installed within the Data Custodian’s own IT environment so that PII never leaves the Data Custodian. PII data can be uploaded via the Contributor Node to be tokenized. However, no PII data is stored within the Contributor Node.
Do Senate Matching Contributor Nodes support SAML integration?
No, Senate Matching does not currently support SAML integration. Support for SAML 2 is being investigated.
Do Senate Matching Contributor Nodes support 2FA?
Support for 2FA is intended to be delivered by supporting SAML integration with an organisations own Identity Provider (IdP). As long as the IdP supports multi-factor authentication, then you will be able to use it with Senate Matching.
What is the possibility to link the slices back to the Contributor Node?
in order to link slices back to a specific Contributor an attacker would need to know the Data Base UUIDs of the target Contributors, as well as have access to one or more Matcher Node databases. There is currently no known way to do this. In any event:
- The hash slice values would be too short to identify an individual;
- The token values are encrypted with a different key on for each contributor and matcher node, so hash slice values cannot be joined between matcher databases;
- A Contributor can delete a token database at any time and purge all data, including the Data Base UUID values, from the matcher network.
How can you stop un-authorized Contributor Nodes seeding incorrect matching data?
The Contributor Node's TLS certificate authorizes the node to update only the token databases associated with the contributing organization. In addition, token databases are identified by randomly generated UUIDs, which are not known to other contributors.
How do you patch Contributor Nodes after an update? How often do you patch? Do you tell me how to patch because it is in my environment?
Update notifications are distributed via email and the Data Republic Help Centre. Release Notes include instructions for applying the latest update, as well as information about the changes that have been made and if any security-related patches are included.
Data Republic will not automatically apply patches to software running in your environment. The instructions will advise you how you can apply required changes.
What is a Matcher Node?
The Matcher Node is the technical component in the Data Republic Matching network that stores hashed splices of PII during the tokenization process. This means that no one Matcher Node can contain an entire hashed field value for PII.
When a request for matching is made, the Matcher Node compares hash splits for each token and returns Token pairs to an Aggregator Node. The Aggregator Node will retain matched tokens common to all Matcher Nodes and provide this to the requestor in the form of a token match table, allowing users to perform their own table join for matches.
What data gets sent to the Matcher Nodes?
Each customer record consists of one or more fields, and each field consists of multiple slices. Slices are distributed amongst the different matcher nodes. For each slice, the Contributor node will send:
- An encrypted token (token is encrypted with a different key for each contributor and matcher node);
- UUID of the Contributor's token database (randomly generated - it does not identify the contributor);
- Hash slice value (the portion of the hash, may be 6-17 bits long depending on database size);
- A field identifier for this hash slice (e.g. "field 1, slice 2");
- The bit length of the hash slice.
How do the Senate Matcher nodes authenticate with each other?
All Senate Matcher nodes connections are mutually TLS authenticated
Is there certificate-based authentication between Contributor Nodes and Matcher Nodes?
Yes. Contributor Nodes have their own certificate signed by Data Republic and this is checked by the Matcher Nodes when establishing the SSL/TLS link.
Can we have a private Matcher Network?
Private Matcher networks are being reviewed and scoped, but not yet available.
In the interim, Contributors may want to consider agreeing and using their own salt values. Since data can be pre-hashed prior to uploading to a Contributor Node, a Contributor could choose any salt value they wanted. This value would only have to be known by the other Contributors that intend to match.
This would prevent the possibility of matching with non-participating partners, but this may be a trade off worth making for some data custodians.
How does the tokenization and hashing process work?
Watch a short summary of the Senate Matching tokenization Process
After the data is uploaded into the Contributor Node, it will generate a randomized token. A token is a single string of random letters and numbers. This token is not derived from the data itself. Tokens do not hold any PII and can be appended to attribute datasets on Senate, allowing approved datasets to be matched at an individual level without the use of PII on Senate. Data Custodians may choose to download the tokens if you wish to later re-identify the tokenized individuals (subject to consumer consent and your privacy policies).
The tokens are then prepared for hashing. For hashing, the algorithm used is SHA 512 as it is cryptographically strong and resistant to known attacks against older hashing algorithms. Salt, or a random string different for each field, is appended to the end of the plain text field prior to hashing. The salt value is only distributed to Senate Matching by known Contributor Nodes.
Next is the process of slicing, shortening and distribution of hash fragments against the known network. Hashes are split into 16 slices, containing 32 bits each. The Contributor Node will then make two calculations:
- How short should a slice be, to increase the chances that a hash slice will collide (have the same value as) a large number of completely unrelated individuals?
- How many slices should be sent, to allow the Aggregator Node to filter out all the false matches?
In a typical scenario, an individual hash slice might be 12-14 bits long, with 4-5 slices sent to the matcher nodes. The rest are discarded.
Because the tokens are also encrypted (with a different key for each Contributor and each slice) and because much of the original hash value is discarded, it is not possible to “put the hash back together.” If fragments are ever leaked, the PII can not be reconstructed, nor can an individual be re-identified.
What is data matching?
Data matching is the process of record linkage across multiple datasets by comparing shared data. Senate Matching is designed to facilitate this kind of matching, but without either party ever having to disclose the contents of the shared data (usually, customer PII). Since Personally Identifiable Information (PII) is not allowed on the Senate Platform, data matching is performed against tokenized datasets made available by Data Custodians (data owners) on Senate. All matching activities are subject to customer consent and Data Custodians must have the applicable privacy policies and collection statements in place.
What is Personally Identifiable Information (PII)?
Personally Identifiable Information (PII) is any information that can be used to reasonably identify a single person. This may include data such as an email address, street address, driver's license number, phone number, and social security number. However, it may also include information that does not directly reference a person but could be used to re-identify someone when combined with other details. For example, IP addresses, location history or employment history.
What is a token?
A token is a single string of seemingly random letters and numbers generated by the Senate Matching service for each person. Tokens do not hold any PII and can be appended to attribute datasets on Senate, allowing approved datasets to be matched at an individual level without the use of PII on Senate.
How are tokens used in data matching?
By adding tokens to de-identified datasets on Senate, tokens across two datasets can be matched in Senate without risking PII. The results of a data match can be used to build matched data products on Senate that will help organizations to enrich customer data views, unlock behavioral patterns for targeted demographics or reveal a targeted audience.