What is a local data source?
A local data source refers to data stored and hosted locally by your organization. Data Republic currently supports AWS S3 as a local data source with the Customer Cloud Suite.
In this article you will learn how to:
- Create a local data source in Senate
- Connect to your local data source (AWS S3 bucket)
- Crawl metadata
Your organization has configured your AWS S3 bucket policy to enable Data Republic to connect to it.
Create a local Data Source in Senate
To link your existing data source(s) to a Project on Senate, you must first create a local data source in Senate to point to your S3 bucket(s).
- Within Senate, click Manage Data from the left navigation menu. On the Manage Data tab, click the +Add local data source button to create a new local data source.
- Give your data source a name and a description.
- Click the Add local data source button to submit.
Do I need to create a local data source for each Project?
No, you do not need to create a local data source for each Project.
Once you have connected your local data source to Senate, you can create a Data Package to select the files you want to make available in a Workspace (once a Data License is approved). The structured data files will appear as Tables in the Workspace for you to query.
Connect to your local S3 bucket
After you have applied the policy to your local S3 bucket(s), you are now ready to connect to the S3 buckets:
- Enter the S3 bucket URL(s) to connect; if successful, you will see a green tick.
- Click +Add another S3 bucket URL if you wish to add multiple S3 URLs.
- Click Remove if you no longer wish to add the S3 URL
- Check all the information is correct before clicking Submit. After submitting, you will not be able to make any changes to this data source i.e. the URL cannot be edited or removed.
What is Crawling?
Data Republic uses AWS glue to crawl the data. Crawling the data retrieves the associated metadata (e.g. table definition and schema) and stores it in the AWS Glue Data Catalog. Once catalogued, your data is immediately searchable, queryable and available for ETL.
The crawler sits in Data Republic’s AWS account, as does the Data Catalog.
Note: if you update the schema without crawling, the queries will fail.
Note: The crawler is restricted to only pick up parquet (preferred) and csv files.
The three Crawler statuses are as follows:
- Ready - crawler is waiting to crawl metadata
- Running - crawler is active and currently crawling
- Stopping - crawler is coming to an end
- Click Crawl Metadata. Crawler will start to crawl the metadata.
2. Crawler will proceed to run and the status will automatically refresh every 5 seconds.
3. When the crawling has come to an end, you will see 'Succeed' displayed next to the last crawl.
4. Your local data source is now ready to be packaged for exchange.