This article explains the tokenization, salting and hashing process in Senate Matching. If you are unfamiliar with Senate Matching please review An introduction to Data Republic's Senate Matching technology.
A high-level overview of the end-to-end Senate Matching process
When it comes to PII management, Senate Matching means that PII never leaves your organization. Tokenization and hash slicing are performed before datasets are available for exchange on Senate.
Watch this video for insight into the tokenization, salting and hashing process.
Tokenization of personally identifiable information
Contributors upload customer data into a Contributor Node behind their firewall. The node assigns a randomised token and returns the token to the contributor. These tokens are downloaded as a file to be used for confirming a match on the Senate platform.
The token is completely random and does not reference the PII fields.
The salting and hashing process
After the tokenization process, PII fields are salted. We append a salt which is a random string, different for each field. Salt is appended to end of plain-text field prior to hashing. The salt value is distributed by the Senate Matching network and known only by Contributor Nodes.
The salted PII fields are then hashed (encrypted in a way that cannot be reversed), and shredded into small parcels of characters within the Data Custodian’s IT environment (via a Contributor Node). Each slice of hashed is then sent to a different Matcher Node for later matching. A network of Matcher Nodes within trusted organizations in Data Republic’s ecosystem then allows for decentralised matching calls and token retrieval.
Slicing and distributing hash fragments
The problem created by building an entire hash is the risk of re-identification. The solution that Senate Matching offers is the data slicing and distribution of hash fragments across the node network. We need a high chance of ‘collisions’ – false positive matches where multiple email addresses will have the same hash slice inside a given matcher node.
By using Senate Matching hashes are split into 16 slices of 32-bits each. Slices are sent to pre-allocated matcher nodes. For example, all the “part 1 of email address” slices from all Contributors go to “matcher node 1”.
Not all the slices are used. This is a security feature – it means that if the hash slices ever leak it is impossible to reconstruct the entire hash.
Senate Matching uses variable length slicing meaning different contributors will use different sized slices, depending on how many unique customers they have. This is to guarantee false positive matches inside a matcher node to control re-identification risk
What is hashing and what method is used to tokenize PII?
Hashing is a one-way cryptographic function, that sees PII combined into a single string of seemingly random letters and numbers. SHA-512 is the secure hashing method that is used by the Senate Matching service.
SHA-512 is fast to compute, well understood, and supported by a wide variety of programming languages. SHA-512 is preferred over SHA-256 because of its greater resistance to certain kinds of advanced attacks.
We also add salt (random data) to the hash to defeat a common kind of attack against hashing called a Rainbow Table, where an attacker pre-calculates millions of hashed values to use as a lookup table if they ever come across a hash value and want to reverse it. Note that this attack is already greatly weakened by the fact that the Senate Matching service split hashes into slices and sends them to different Matcher Nodes across the Data Republic Matcher Node Network.