For Privacy-Preserving Matching to be most accurate, PI data formatting need to be aligned with all organizations involved in a project. This will avoid lower match rates, in situations where the PI is formatted slightly differently. We recommend that all organizations involved discuss and align on normalization rules for the PII data.

Additionally, Data Republic has normalization and formatting rules that can be used for all organizations, for all available matching fields.

To see which fields are available, you are able to download the PI template from the Contributor Node to see which fields are available for you to use in your Matching Project.

Below are recommendation for formatting and cleaning PI data before loading into your CN for each field:

Prerequisites:

  • You have split your files into a file with your personid and your attributes (if relevant for your project), and a file with your personid and your PII fields (as per the fields allowed with the Contributor Node).

personid

Type: String (varchar 100)

Description:

  • A unique string that identifies the customer record.
  • Should uniquely identify the person within the token database (this ID usually comes from your CRM or data lake).
  • Important: Do not use personid to store PII (e.g. an email address). Appropriate values are internal customer numbers, database primary keys, or some other non-identifying value.
  • Your Contributor Node will maintain a mapping table between this personid and the randomly generated token.
  • Customers typically download this mapping table to store the token values in their CRM or EDW, which can then be used to upload anonymised attribute data to the Data Republic platform.
  • Only one value per row allowed

Formatting and Normalization Rules:

  • Mandatory, case sensitive, and unique within token database
  • Allowed characters are alpha-numeric plus underscore (_), semi-column (:) and dot (.)
  • Max 100 characters
  • No normalization or hashing is applied – this field does not leave your Contributor Node

email

Type: String (varchar)

Description:

  • Person's email address
  • Multiple values are allowed (2)

Formatting and Normalization Rules:

  • Provide email address only (e.g. test@example.com)
  • Contributor Node supports UTF-8 in email addresses
  • Normalisation (pre-hash): email

phone

Type: Numeric

Description:

  • Person's phone number. Use local format (area code + phone number)
  • Multiple values are allowed (2)

Formatting and Normalization Rules:

  • Provide phone numbers with area codes (where applicable) but strip international dialling country codes.
  • Normalisation (pre-hash): phone

dpid

Type: Numeric

Description:

  • Delivery Point Identifier. This is an 8-digit number which is allocated to each address maintained in Australia Post’s National Address File.
  • Contact DR for list of suppliers who can convert address data to DPID.
  • Multiple values are allowed (2)

Formatting and Normalization Rules:

  • Provide complete 8-digit code.
  • Do NOT strip leading zeros
  • Leave blank if no DPID is available
  • Normalisation (pre-hash): numeric

nationalid

Australian users please note: There is no "national id" in Australia. Use of Medicare, Passport, or Tax File Numbers is prohibited.

Type: String (varchar)

Description:

  • National Identifier (e.g. a Social Security Number in USA; or NRIC in Singapore)
  • This is a unique identifier of each citizen in a country
  • Only one value allowed
  • Important: There may be regulatory restrictions on using a "national ID" in matching projects. Check with DR or legal team.

Formatting and Normalization Rules:

  • Provide complete code.
  • Normalisation (pre-hash): uppercase

frequent_flyer_number

Type: String (varchar)

Description:

  • Person's Frequent Flyer Number (if exists)
  • Only one value allowed

Formatting and Normalization Rules:

  • Provide complete code.
  • Normalisation (pre-hash): uppercase

custom_name

Type: String (varchar)

Description:

  • Full name of a person, with customer qualifiers as agreed between matching parties (e.g. addition of post code or birthdate - see below)
  • Multiple values are allowed (2)

Formatting and Normalization Rules:

  • Concatenate given_name and family_name, separated by a space.
  • Remove titles and suffixes (e.g. Mr, Ms, Dr, or Jnr, esq.)
  • Add qualifiers as agreed with matching partner (e.g. post code), preceded by a space.
  • Normalisation (pre-hash): name

birthdate

Type: Date

Description:

  • Person's date of birth, in YYYY-MM-DD format
  • Note that this field is used as a qualifier – matching cannot happen on birthdate alone

Formatting and Normalization Rules:

  • Normalisation (pre-hash): numeric

family_name

Type: String (varchar)

Description:

  • Family name of a person (AKA "surname")
  • In some countries, this might be referred to as a "last name", however note that there are regional differences in the ordering of family and given names. This field should always be the family name.
  • Multiple values are allowed (2)

Formatting and Normalization Rules:

  • Remove or exclude titles (e.g. Mr, Ms, Dr) and suffixes (e.g. Jnr, III, esq.)
  • Contributor Nodes support UTF-8, make sure you're using UTF-8 string encoding.
  • Normalisation (pre-hash): name

given_name

Type: String (varchar)

Description:

  • Given name of a person (AKA "Christian name")
  • In some countries, this might be referred to as "first name", however note that there are regional differences in the ordering of family and given names. This field should always be the given name.
  • Multiple values are allowed (2)

Formatting and Normalization Rules:

  • Same as 'family_name' → name

postcode

Type: String (varchar)

Description:

  • Person's postcode, that is part of their residential address

Formatting and Normalization Rules:

  • Normalisation (pre-hash): numeric

Related Articles:

Did this answer your question?