Supported types

For simple comma-separated-value (without double-quote)

Simple CSV files are serialised with LazySimpleSerDe.

Expected type

Example Value

Detected as

Manual update required

bigint

10000000

bigint

-

binary

1

bigint

  • Update schema

boolean

TRUE

boolean

-

char(255)

a

string

  • Update schema

date

2020-01-15

string

  • Update schema

decimal

1.0

double

  • Update schema

double

1000.0

double

-

float

100.0

double

  • Update schema

int

32768

bigint

  • Update schema

interval

?

string

?

smallint

10

bigint

  • Update schema

string

Test 1

string

-

See Limitation > Comma within string

timestamp

2020-01-15 03:44:40

string

  • Update schema

tinyint

1

bigint

  • Update schema

varchar(256)

test 1

string

  • Update schema

For comma-separated-value with double-quote

CSV files with double-quote are serialised with OpenCSVSerDe.

Expected type

Example Value

Detected as

Manual update required

bigint

10000000

bigint

-

binary

1

bigint

  • Update schema

boolean

TRUE

boolean

-

char(255)

a

string

  • Update schema

date

18276

string

  • Update schema

decimal

1.0

double

  • Update schema

double

1000.0

double

-

float

100.0

double

  • Update schema

int

32768

bigint

  • Update schema

interval

?

string

?

smallint

10

bigint

  • Update schema

string

Test 1

string

-

timestamp

1579059880000

string

  • Update schema

tinyint

1

bigint

  • Update schema

varchar(256)

test 1

string

  • Update schema

Limitation

Comma within string

Since CSV files use the comma character "," to separate columns, values that contain commas must be handled as a special case. These fields are wrapped within double quotation marks. The first double quote signifies the beginning of the column data, and the last double quote marks the end. If the value contains a string with double quotes, these are replaced by two double quotes, or "".

Due to the double quote, please refer to the next section “For comma-separated-value with double-quote“ for value detection.

Date & Timestamp in CSV

Depending on the serialisation of the CSV file, the date and timestamp formats need to be under different formats:

CSV file

Date

Timestamp

Without double-quote

yyyy-MM-dd, e.g. 2020-01-15

yyyy-MM-dd HH:mm:ss , e.g. 2020-01-15 03:44:40

With double-quote

UNIX numeric format - the value represents the number of days that elapsed since January 1, 1970. Each day considered to have 86,400 seconds. E.g. 18276 will be rendered as 2020-01-15 when queried.

UNIX numeric format - the value represents the number of milliseconds that elapsed since January 1, 1970. E.g. 1579059880000 will be rendered as 2020-01-15 03:44:40 when queried.

Did this answer your question?