For Privacy-Preserving Matching to be most accurate, PI data formatting need to be aligned with all organizations involved in a project. This will avoid lower match rates, in situations where the PI is formatted slightly differently. We recommend that all organizations involved discuss and align on normalization rules for the PII data.
Additionally, Data Republic has normalization and formatting rules that can be used for all organizations, for all available matching fields.
To see which fields are available, you are able to download the PI template from the Contributor Node to see which fields are available for you to use in your Matching Project.
Below are recommendation for formatting and cleaning PI data before loading into your CN for each field:
- personid
- phone
- dpid
- nationalid
- frequent_flyer_number
- custom_name
- birthdate
- family_name
- given_name
- postcode
Prerequisites:
- You have split your files into a file with your personid and your attributes (if relevant for your project), and a file with your personid and your PII fields (as per the fields allowed with the Contributor Node).
personid
Type: String (varchar 100)
Description:
- A unique string that identifies the customer record.
- Should uniquely identify the person within the token database (this ID usually comes from your CRM or data lake).
- Important: Do not use personid to store PII (e.g. an email address). Appropriate values are internal customer numbers, database primary keys, or some other non-identifying value.
- Your Contributor Node will maintain a mapping table between this personid and the randomly generated token.
- Customers typically download this mapping table to store the token values in their CRM or EDW, which can then be used to upload anonymised attribute data to the Data Republic platform.
- Only one value per row allowed
Formatting and Normalization Rules:
- Mandatory, case sensitive, and unique within token database
- Allowed characters are alpha-numeric plus underscore (_), semi-column (:) and dot (.)
- Max 100 characters
- No normalization or hashing is applied – this field does not leave your Contributor Node
Type: String (varchar)
Description:
- Person's email address
- Multiple values are allowed (2)
Formatting and Normalization Rules:
- Provide email address only (e.g. test@example.com)
- Contributor Node supports UTF-8 in email addresses
- Normalisation (pre-hash): email
phone
Type: Numeric
Description:
- Person's phone number. Use local format (area code + phone number)
- Multiple values are allowed (2)
Formatting and Normalization Rules:
- Provide phone numbers with area codes (where applicable) but strip international dialling country codes.
- Normalisation (pre-hash): phone
dpid
Type: Numeric
Description:
- Delivery Point Identifier. This is an 8-digit number which is allocated to each address maintained in Australia Post’s National Address File.
- Contact DR for list of suppliers who can convert address data to DPID.
- Multiple values are allowed (2)
Formatting and Normalization Rules:
- Provide complete 8-digit code.
- Do NOT strip leading zeros
- Leave blank if no DPID is available
- Normalisation (pre-hash): numeric
nationalid
Australian users please note: There is no "national id" in Australia. Use of Medicare, Passport, or Tax File Numbers is prohibited.
Type: String (varchar)
Description:
- National Identifier (e.g. a Social Security Number in USA; or NRIC in Singapore)
- This is a unique identifier of each citizen in a country
- Only one value allowed
- Important: There may be regulatory restrictions on using a "national ID" in matching projects. Check with DR or legal team.
Formatting and Normalization Rules:
- Provide complete code.
- Normalisation (pre-hash): uppercase
frequent_flyer_number
Type: String (varchar)
Description:
- Person's Frequent Flyer Number (if exists)
- Only one value allowed
Formatting and Normalization Rules:
- Provide complete code.
- Normalisation (pre-hash): uppercase
custom_name
Type: String (varchar)
Description:
- Full name of a person, with customer qualifiers as agreed between matching parties (e.g. addition of post code or birthdate - see below)
- Multiple values are allowed (2)
Formatting and Normalization Rules:
- Concatenate given_name and family_name, separated by a space.
- Remove titles and suffixes (e.g. Mr, Ms, Dr, or Jnr, esq.)
- Add qualifiers as agreed with matching partner (e.g. post code), preceded by a space.
- Normalisation (pre-hash): name
birthdate
Type: Date
Description:
- Person's date of birth, in YYYY-MM-DD format
- Note that this field is used as a qualifier – matching cannot happen on birthdate alone
Formatting and Normalization Rules:
- Normalisation (pre-hash): numeric
family_name
Type: String (varchar)
Description:
- Family name of a person (AKA "surname")
- In some countries, this might be referred to as a "last name", however note that there are regional differences in the ordering of family and given names. This field should always be the family name.
- Multiple values are allowed (2)
Formatting and Normalization Rules:
- Remove or exclude titles (e.g. Mr, Ms, Dr) and suffixes (e.g. Jnr, III, esq.)
- Contributor Nodes support UTF-8, make sure you're using UTF-8 string encoding.
- Normalisation (pre-hash): name
given_name
Type: String (varchar)
Description:
- Given name of a person (AKA "Christian name")
- In some countries, this might be referred to as "first name", however note that there are regional differences in the ordering of family and given names. This field should always be the given name.
- Multiple values are allowed (2)
Formatting and Normalization Rules:
- Same as 'family_name' → name
postcode
Type: String (varchar)
Description:
- Person's postcode, that is part of their residential address
Formatting and Normalization Rules:
- Normalisation (pre-hash): numeric