What is Senate Matching?
Senate Matching is Data Republic’s service (SaaS) for Privacy Preserving Record Linkage (PPRL). Using this service, Data Custodians generate randomized tokens to replace Personally Identifiable Information (PII) in data uploaded to Senate. This ensures that data in Senate is not directly tied to an individual’s identity.
Using Senate Matching, organizations can de-identify data, while still preserving the capability to match records between de-identified records.
Matching activities are governed as part of projects on the Senate platform and require both the consent of the Data Custodian as well as an approved license.
For more information please see, What is Senate Matching?
What sets Senate Matching apart? Why should we use Senate Matching?
- Private by Design - Unique privacy-preserving technology allows organizations to link /match records without customer personal information needing to leave secure organizational environments.
- Control over data assets - Senate’s Governance Framework ensures that Data Custodians retain full control and enforceable rights over who can access the data and how it is used through the data licence.
- Control over uses of data - All match requests and proposed matching applications are subject to Data Custodian consent and an approved license. Analysts in workspaces never receive PII, only tables which indicate where a match has occurred against one of their own tokens.
- Decentralization of Data - Data Republic’s Senate Matching Platform uses a decentralized solution to ensure that there is no 'token honeypot' or single point of vulnerability. Even in the unlikely event of a matcher node breach, only small token slices would be recoverable, with no means for these tokens to be reconstituted in order to pose a re-identification risk.
Where does my data go? Who can access my PII?
This decentralized process ensures that PII does not leave a contributors environment when preparing data for Senate projects. The original PII data remains within a contributor’s internal environment (aka behind their firewall). No one will be able to access your PII, not even DR, as PII does not go onto the Senate Platform.
Authorized analysts in workspaces, who have received approval to conduct a matching project, never receive PII, only tables which indicate where a match has occurred against one of their own tokens.
How does Senate Matching Work?
Senate Matching Process
Senate Matching software called a Contributor Node is installed behind your organization’s firewall.
Data Custodians prepare data for Senate projects by loading it to the Contributor Node. The node assigns a randomized token for each customer record, and returns the token back to the contributor. Data is cleansed by the contributor node by normalising the PII (e.g. email is converted to lowercase and spaces are removed), and fix common formatting differences. The node then salts and hashes the PII fields, divides the hash value into slices, discards bits from those slices, then distributes these small slices and their corresponding token amongst matcher nodes on the Senate platforms. Tokens are encrypted with keys unique to each Contributor, with a different key being used for each slice.
Matching activities are governed as part of projects on the Senate platform and require both the consent of the Data Custodian as well as an approved license. Matches are performed through a series of calls across the matching network where the hashed, token slices are verified.
Even if a matcher node is compromised, PII is not recoverable. Only the fragment of a hash could be extracted, with no way to tie that to an identity. On average, a single hash slice will collide with 30 unrelated tokens. In privacy research, this property is referred to as “K-Anonymity”.
Senate Matching Workflow Diagram:
What is a Contributor Node?
The Contributor Node is the technical component of the Senate Matching service that generates Tokens, hashes PII, and distributes hashed splices to Matcher Nodes.
The Contributor Node is installed within the Data Custodian’s own IT environment so that PII never leaves the Data Custodian. PII data can be uploaded via the Contributor Node to be tokenized. However, no PII data is stored within the Contributor Node.
What is a Matcher Node?
The Matcher Node is the technical component in the Data Republic Matching network that stores hashed splices of PII during the tokenization process. This means that no one Matcher Node can contain an entire hashed field value for PII.
When a request for matching is made, the Matcher Node compares hash splits for each token and returns Token pairs to an Aggregator Node. The Aggregator Node will retain matched tokens common to all Matcher Nodes and provide this to the requestor in the form of a token match table, allowing users to perform their own table join for matches.
How does the tokenization and hashing process work?
Watch a short summary of the Senate Matching tokenization Process
After the data is uploaded into the Contributor Node, it will generate a randomized token. A token is a single string of random letters and numbers. This token is not derived from the data itself. Tokens do not hold any PII and can be appended to attribute datasets on Senate, allowing approved datasets to be matched at an individual level without the use of PII on Senate. Data Custodians may choose to download the tokens if you wish to later re-identify the tokenized individuals (subject to consumer consent and your privacy policies).
The tokens are then prepared for hashing. For hashing, the algorithm used is SHA 512 as it is cryptographically strong and resistant to known attacks against older hashing algorithms. Salt, or a random string different for each field, is appended to the end of the plain text field prior to hashing. The salt value is only distributed to Senate Matching by known Contributor Nodes.
Next is the process of slicing, shortening and distribution of hash fragments against the known network. Hashes are split into 16 slices, containing 32 bits each. The Contributor Node will then make two calculations:
- How short should a slice be, to increase the chances that a hash slice will collide (have the same value as) a large number of completely unrelated individuals?
- How many slices should be sent, to allow the Aggregator Node to filter out all the false matches?
In a typical scenario, an individual hash slice might be 12-14 bits long, with 4-5 slices sent to the matcher nodes. The rest are discarded.
Because the tokens are also encrypted (with a different key for each Contributor and each slice) and because much of the original hash value is discarded, it is not possible to “put the hash back together.” If fragments are ever leaked, the PII can not be reconstructed, nor can an individual be re-identified.
What is data matching?
Data matching is the process of record linkage across multiple datasets by comparing shared data. Senate Matching is designed to facilitate this kind of matching, but without either party ever having to disclose the contents of the shared data (usually, customer PII). Since Personally Identifiable Information (PII) is not allowed on the Senate Platform, data matching is performed against tokenized datasets made available by Data Custodians (data owners) on Senate. All matching activities are subject to customer consent and Data Custodians must have the applicable privacy policies and collection statements in place.
What is Personally Identifiable Information (PII)?
Personally Identifiable Information (PII) is any information that can be used to reasonably identify a single person. This may include data such as an email address, street address, driver's license number, phone number, and social security number. However, it may also include information that does not directly reference a person but could be used to re-identify someone when combined with other details. For example, IP addresses, location history or employment history.
What is a token?
A token is a single string of seemingly random letters and numbers generated by the Senate Matching service for each person. Tokens do not hold any PII and can be appended to attribute datasets on Senate, allowing approved datasets to be matched at an individual level without the use of PII on Senate.
How are tokens used in data matching?
By adding tokens to de-identified datasets on Senate, tokens across two datasets can be matched in Senate without risking PII. The results of a data match can be used to build matched data products on Senate that will help organizations to enrich customer data views, unlock behavioral patterns for targeted demographics or reveal a targeted audience.