This article provides details on uploading and managing your structured and unstructured files on the IXUP Data Sandbox platform.

In this article you will learn about:

  • Data Vault drive

  • Supported File Formats

    • Optimal file sizes

    • Code or Algorithms using a Docker container

  • Uploading structured or unstructured data

  • Deleting data

Data Vault drive

When your organisation is created into the Data Sandbox platform, a drive is automatically created with 2 folders:

  • files - to manage your unstructured data

  • databases - to manage your structured or semi-structured data in tables structure

/{my-organisation-drive}
/files
/databases

Supported File Formats

IXUP Data Sandbox platform supports the following file formats:

Structured/Semi-structured

Type

Notes

Structured

Delimited (CSV)

Comma delimiter is supported, i.e. CSV.

Semi-structured

Parquet

Includes automatic detection and processing of compressed Parquet files.

Unstructured

Any file type

E.g. scripts, Docker container, etc.

Related article: How does Data Vault detect schema from CSV files?

Optimal file sizes

For structured or semi-structured data, we recommend keeping the header row and breaking up large files into smaller files rather than trying to load one large file to the IXUP Data Sandbox platform.

Queries run more efficiently when data scanning can be parallelized and when blocks of data can be read sequentially. Ensuring that your file formats are splittable helps with parallelism regardless of how large your files may be.

However, if your files are too small (generally less than 128 MB), the execution engine might be spending additional time with the overhead of opening S3 files, listing directories, getting object metadata, setting up data transfer, reading file headers, reading compression dictionaries, and so on. On the other hand, if your file is not splittable and the files are too large, the query processing waits until a single reader has completed reading the entire file. That can reduce parallelism.

One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS.

Some benefits of having larger files include faster listing, fewer Amazon S3 requests, and less metadata to manage.

For example, the following table compares query runtimes between two tables, one backed by a single large file and one by 100,000 small files. Both tables contain approximately 8 GB of data, stored in text format.

Related article: Top 10 Performance Tuning Tips for Amazon Athena

Code or Algorithms using a Docker container

IXUP Data Sandbox platform supports running Rootless Docker securely, within CPU Linux Workspaces. Rootless docker works by starting the docker daemon as the current user, as a pose to root in traditional installations.

  • Prepare your Docker container locally and save as a .tar file (example below)

1docker save <image name>:<tag> > <name>.tar

  • Upload it via SFTP or S3 copy.

Uploading structured or unstructured data

Depending on the type of data, you will need to upload the files in 2 different Data Vault drive’s folders:

  • files - to manage your unstructured data

  • databases - to manage your structured or semi-structured data in tables structure

1/<organisation-data-vault-drive> 2 /files 3 /<any-folder-name> 4 /<any-file-name>.<any-extension> 5 /databases 6 /<data-vault-database> 7 /<data-vault-table> 8 /<file-name>.<csv|parquet>

There are two ways to upload structured and unstructured files on the IXUP Data Sandbox platform onto the IXUP Data Sandbox platform:

Deleting data

If you have loaded the wrong file into the IXUP Data Sandbox platform, or if you would like to remove a file, you can delete the file via SFTP connection.

Did this answer your question?