A variety of PII fields and matching combinations are supported by our Privacy-Preserving Matching by default. This article outlines our default schema and matching combinations, how to work with fields that have multi values and some general guidelines when preparing your data for upload to the Contributor Node.

General Notes

These notes apply to all Fields

Field type or scenario

Considerations for formatting

Missing / No data

When a field does not have any data, it should be left empty. Do not use "space", hyphen, NULL character, or other placeholder values.

Numeric fields

Take particular care with numeric fields that they do not default to "0" when an empty value should be used (e.g. due to missing data)

Large integers

Large integers should not be written in scientific notation, as this often loses precision. This can be particularly problematic for phone numbers – take care to export these as string.

Custom fields

Custom fields may be added – contact DR.

Tokens (as strings)

They are random integers written in hexadecimal notation. You should process them as "opaque strings" to avoid any conversion to scientific notation, stripping of leading zeros, or loss of precision caused by conversion to floating point types.

When updating a CSV

Privacy-Preserving Matching takes the whole record to be an "upsert" operation.

  • If the personid (natural key) is new, a new token is generated and the fields inserted.

  • If the personid exists in the database, the previously issued token is used and the PII fields are updated.

  • If a PII field is blank, it is taken to mean that you want any previous values deleted.

  • If a PII field is non-blank, it is taken to mean that any previous values are to be overwritten.

  • In other words, the "unit of operation" is the record. To update "just" the email, you must send values for all of the fields, including the new email.

Data upload format

  • The data format for uploading PII to a Contributor Node is a simple, single table format.

  • Upload is typically via CSV however a JSON format is also available via the API.

  • When uploading using CSV files, please use consistent line endings – i.e. all lines should end in DOS/Windows line endings (CR+LF) or Unix line endings (CR). Mixing line ending styles in the same file yields an error.

Default Schema

NOTE: CN template headers are case sensitive - should all be in lower case

Field Name

Type

Description

Formatting & Normalization Rules

personid

String (varchar 100)

  • A unique string that identifies the customer record.

  • Should uniquely identify the person within the token database (this ID usually comes from your CRM or data lake).

  • Important: Do not use personid to store PII (e.g. an email address). Appropriate values are internal customer numbers, database primary keys, or some other non-identifying value.

  • Your Contributor Node will maintain a mapping table between this personid and the randomly generated token.

  • Customers typically download this mapping table to store the token values in their CRM or EDW, which can then be used to upload anonymised attribute data to the Data Republic platform.

  • Only one value per row allowed

  • Mandatory, case sensitive, and unique within token database

  • Allowed characters are alpha-numeric plus underscore (_), semi-column (:) and dot (.)

  • Max 100 characters

  • No normalization or hashing is applied – this field does not leave your Contributor Node

email

String (varchar)

  • Person's email address

  • Multiple values are allowed (2)

  • Provide email address only (e.g. test@example.com)

  • Contributor Node supports UTF-8 in email addresses

  • Normalisation (pre-hash): email

phone

Numeric

  • Person's phone number. Use local format (area code + phone number)

  • Multiple values are allowed (2)

  • Provide phone numbers with area codes (where applicable) but strip international dialling country codes.

  • Normalisation (pre-hash): phone

dpid

Numeric

  • Delivery Point Identifier. This is an 8-digit number which is allocated to each address maintained in Australia Post’s National Address File.

  • Contact DR for list of suppliers who can convert address data to DPID.

  • Multiple values are allowed (2)

  • Provide complete 8-digit code.

  • Do NOT strip leading zeros

  • Leave blank if no DPID is available

  • Normalisation (pre-hash): numeric

nationalid

String (varchar)

  • National Identifier (e.g. a Social Security Number in USA; or NRIC in Singapore)

  • This is a unique identifier of each citizen in a country

  • Only one value allowed

  • Important: There may be regulatory restrictions on using a "national ID" in matching projects. Check with DR or legal team.

  • Provide complete code.

  • Normalisation (pre-hash): uppercase

  • Australian users please note: There is no "national id" in Australia. Use of Medicare, Passport, or Tax File Numbers is prohibited.

frequent_flyer_number

String (varchar)

  • Person's Frequent Flyer Number (if exists)

  • Only one value allowed

  • Provide complete code.

  • Normalisation (pre-hash): uppercase

custom_name

String (varchar)

  • Full name of a person, with customer qualifiers as agreed between matching parties (e.g. addition of post code or birthdate - see below)

  • Multiple values are allowed (2)

  • Concatenate given_name and family_name, separated by a space.

  • Remove titles and suffixes (e.g. Mr, Ms, Dr, or Jnr, esq.)

  • Add qualifiers as agreed with matching partner (e.g. post code), preceded by a space.

  • Normalisation (pre-hash):

birthdate

Date

  • Person's date of birth, in YYYY-MM-DD format

  • Note that this field is used as a qualifier – matching cannot happen on birthdate alone

  • Normalisation (pre-hash): numeric

family_name

String (varchar)

  • Family name of a person (AKA "surname")

  • In some countries, this might be referred to as a "last name", however note that there are regional differences in the ordering of family and given names. This field should always be the family name.

  • Multiple values are allowed (2)

  • Databank Users: Unlike Databank, there is no special handling of suffixes that might have been included (e.g. "Jnr" or "the third"). DR recommends you remove these.

  • Remove or exclude titles (e.g. Mr, Ms, Dr) and suffixes (e.g. Jnr, III, esq.)

  • Contributor Nodes support UTF-8, make sure you're using UTF-8 string encoding.

  • Normalisation (pre-hash): name

given_name

String (varchar)

  • Given name of a person (AKA "Christian name")

  • In some countries, this might be referred to as "first name", however note that there are regional differences in the ordering of family and given names. This field should always be the given name.

  • Multiple values are allowed (2)

  • Same as 'family_name' → name

postcode

String (varchar)

  • Person's postcode, that is part of their residential address

  • Normalisation (pre-hash): numeric

Default Matching Combinations

The following fields are always matched because they are generally "unique enough" in a data set to be useful to match on.

  • email

  • phone

  • dpid

  • nationalid

  • frequent_flyer_number

  • custom_name

These next fields however are used as "qualifiers" – they are not matched on their own but instead are matched in combination. This is because alone, they are not useful fields to match on (e.g. it is not useful to match all the first names together, as it would include many false positives).

  • email + given_name

  • custom_name + postcode

  • phone + given_name

  • custom_name + birthdate

  • email + family_name

  • given_name + family_name (available in US & SG only)

  • email + family_name

  • phone + family_name

  • dpid + family_name

  • dpid + given_name

Other combinations can be configured for you. Talk to Data Republic.

Working with Multi-Value Fields

The following fields can accept multiple values for a field (a list of values):

Field name

Max values

email

2

phone

3

dpid

2

nationalid

1

frequent_flyer_number

1

custom_name

2

When using multi-valued fields in a CSV file, please follow these rules / examples:

  • In your CSV file, use the same header name for the field but append a digit to separate into multiple columns, one for each value you want to provide.

  • For example, to provide up to 2 values for "email", use the field names "email:0" and "email:1".

Related Articles:

Did this answer your question?