When it comes to PII management, Data Republic's Privacy-Preserving Matching means that raw PII never leaves your organization. Tokenization and hash slicing are performed before datasets are available for exchange on Data Republic.
Watch this video for insight into the tokenization, salting and hashing process.
Each customer record in the uploaded dataset is assigned a Token ID at random by the Contributor Node. The assigned Token ID for each customer is then returned to the custodian via a mapping file, which maps the natural key in the uploaded dataset to the Token ID generated by the Contributor Node. This mapping is stored on the Contributor Node, so that subsequent data record updates retain the same Token ID.
It is important to note that the Token ID is randomly generated. It is not derived in any way from the source data. Therefore, a Token ID on its own is not capable of being traced back to the subject individual for whom it was generated.
The custodian will then upload the attribute data they intend to exchange to the Data Republic platform, having replaced customer PII with the random tokens just generated. Data Republic’s data exchange platform manages the governance and compliance of data projects. No PII (in its raw form) is ever permitted on the Data Republic Platform.
Hashing and salting
The PII fields in the dataset are transformed via a process known as “hashing” and “salting”. This is a non-reversible process that permanently obscures the PII.
When a PII field is salted, random alphanumerical characters are added to it. The specific value for the salt varies for each field and is managed by the Privacy-Preserving Matching network. For example, if the salt were “XYZ”, and the value to be hashed were “Claudio”, the salted value might be “ClaudioXYZ”. Each Contributor will be given the same salt value for the same field within their particular Privacy-Preserving Matching network. The salt is treated as a “shared secret” and is used to increase the cryptographic strength of the next step, hashing.
Hashing is a one-way process whereby the raw data (containing the “salt”) is transformed into a string of alphanumerical characters such that the raw data is no longer recognizable. This process cannot be reversed, and the process is designed such that a particular piece of data will always produce an identical “hashed” result. However, if there is any change in the data (or the salt), no matter how small, a completely different looking hash is produced. At this point, the original PII data held in the contributor node is destroyed.
All PII fields are actually salted and hashed before being sent to the Contributor Node:
If using the provided Web UI, the hashing is performed in the browser, prior to the browser application calling the Contributor Node API.
Otherwise, the Contributor performs the hashing step on their own systems, using the specifications provided by the Contributor Node API.
Salt values are distributed by the Consul, the Privacy-Preserving Matching configuration service. Salt values are unique to each field name and are randomly generated 128-bit values. The hash algorithm is SHA-512. SHA-512 is preferred over SHA-256 because of its greater resistance to certain kinds of advanced attacks (see Comparison of SHA functions). Each field is associated with a simple normalization function, which is used to ensure that the same PII value can be matched after hashing, even if represented slightly differently between Custodians (e.g. emails with upper or lower case letters).
Variable bit slicing
The Contributor Node slices each PII hash into 32 pre-allocated slices. A predetermined selection of these slices is mapped to a different Matcher Node. Any slice not mapped to a Matcher Node is discarded. The hash slices may be further shortened, in order to ensure that there is a probability that some slices will have “collisions” within an individual Matcher Node. A collision is when two different inputs (e.g. two different email addresses) have the same value for a particular slice.
In order to scale with a manageable number of Matcher Nodes, each individual Matcher Node may contain the hash slices for multiple fields. Because tokens represent a customer record from a Custodian, this represents a potential problem: if a Matcher Node is compromised, an attacker can potentially gain too many bits of information associated with a token, increasing the re-identification risk. To manage this, tokens are encrypted before being sent to a Matcher Node. Tokens are encrypted using AES, with a 256-bit key in ECB (electronic codebook) mode. The key is unique for each combination of Custodian, token database, and hash slice index. Matcher Nodes do not have access to these keys (only the Contributor and Aggregator Nodes can access the key store). Because of this approach, tokens cannot be directly compared within a Matcher Node. If an attacker were to gain access to a Matcher Node, it is not possible to determine which hash slices belong to the same customer record.