How to Import JSON files From AWS S3 into BigQuery

How to Import JSON files From AWS S3 into BigQuery

Need to ingest a bunch of JSON files from S3 into popular Data Warehouses? See how we can do this with Sling. It also auto-creates the Table DDL!

BigQuery & Your S3 Files

BigQuery is a cloud-based, fully managed, serverless data warehouse that enables you to analyze large and complex datasets using SQL queries. It is a scalable, highly-performant, and cost-effective solution that can handle petabyte-scale data with ease. A, BigQuery is a part of the Google Cloud Platform (GCP) and is built on top of Google's infrastructure, it is highly reliable and available.

However, suppose that you have files that you wish to ingest into BigQuery, but the files are located on AWS S3. In fact, they could be located on any S3 Compatible systems, such as AWS S3, DigitalOcean Spaces, BackBlaze B2, Cloudflare R2, MinIO, Wasabi. What to do? Manually transfer all those to a Google Cloud Storage bucket? How about the destination table that needs to be created? Will you manually define all the columns that are part of the JSON file? There is a better way, use Sling.

Sling CLI

Sling CLI is a versatile tool that enables you to transfer data quickly and efficiently between files and databases, as well as between databases. It is capable of reading files in the JSON format, among other file types. Please see here for the full list of compatible connectors.

Installation

Installing Sling is straightforward, regardless of the operating system you are using. This is because it is built using the go programming language and compiles to a single binary file, making it easy to install and run on any system.

# On Mac
brew install slingdata-io/sling/sling

# On Windows Powershell
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# On Linux
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

Please see here for additional installation options (such as downloading binaries). Additionally, there is a Python wrapper library available for Sling that allows you to interact with it from within a Python environment. This can be useful if you prefer to work with Python for data management tasks.

Once installed, you should also be able to use the sling command to access Sling's functionality from the command line.

Loading the data

Let us assume that we desire to ingest the following JSON array file, which includes nested objects:

[
 {
   "_id": "638f4cabc6a57ef261c4d28c",
   "isActive": false,
   "balance": "3,560.61",
   "picture": "http://placehold.it/32x32",
   "age": 26,
   "name": "Winters Koch",
   "company": {
     "name": "EXOZENT",
     "email": "winterskoch@exozent.com",
     "phone": "+1 (808) 519-3916",
     "address": "744 Strickland Avenue, Mooresburg, Connecticut, 1850",
     "about": "Aute cupidatat sint incididunt ullamco sint velit consectetur consectetur nostrud eiusmod velit nostrud voluptate ipsum. Et velit officia excepteur excepteur eu nisi occaecat. Sunt culpa eu nostrud in sint occaecat nulla labore pariatur adipisicing ex irure voluptate qui.\r\n",
     "registered": "2020-07-12T09:54:18 +03:00",
     "latitude": 43.588661,
     "longitude": -136.979445
   },
   "tags": [
     "deserunt",
     "duis"
   ]
 },
...
]

Let us also assume that the file is located on our S3 bucket (at path s3://sling-bucket/records.json), it’s as easy as running the following commands.

# let's set our connections. We can set them with `sling conns`
$ sling conns set AWS_S3 type=s3 bucket=sling-bucket access_key_id=$ACCESS_KEY_ID secret_access_key="$SECRET_ACCESS_KEY"
10:14PM INF connection `AWS_S3` has been set in /Users/fritz/.sling/env.yaml. Please test with `sling conns test AWS_S3`

$ sling conns set BIGQUERY type=bigquery project=my-project dataset=public key_file=/path/to/service.account.json
10:14PM INF connection `BIGQUERY` has been set in /Users/fritz/.sling/env.yaml. Please test with `sling conns test BIGQUERY`

# let's check and test our connections
$ sling conns list
+------------+------------------+-----------------+
| CONN NAME  | CONN TYPE        | SOURCE          |
+------------+------------------+-----------------+
| AWS_S3     | FileSys - S3     | sling env yaml  |
| BIGQUERY   | DB - BigQuery    | sling env yaml  |
+------------+------------------+-----------------+

$ sling conns test AWS_S3
10:15PM INF success!

$ sling conns test BIGQUERY
10:15PM INF success!

# great, now we can run our task
$ sling run --src-conn AWS_S3 --src-stream s3://sling-bucket/records.json --tgt-conn BIGQUERY --tgt-object public.records --mode full-refresh
10:18PM INF connecting to target database (bigquery)
10:18PM INF reading from source file system (s3)
10:18PM INF writing to target database [mode: full-refresh]
10:18PM INF streaming data
10:18PM INF importing into bigquery via local storage
10:18PM INF dropped table public.records
10:18PM INF created table public.records
10:18PM INF inserted 500 rows in 9 secs
10:18PM INF execution succeeded

Cool huh? We didn't even have to set anything up. If we were to ingest numerous files inside a folder (say s3://sling-bucket/path/to/my/folder), we could just input that instead of the file path, Sling will read all files in the folder.

Flattening our JSON Records

To flatten a JSON file when using Sling, you can specify the flatten: true in the source option as a flag. This will cause Sling to create individual columns for each key in the records, rather than storing the data in a nested format. For more information on Sling's configuration options, please see the following documentation: docs.slingdata.io/sling-cli/configuration. To specify the flatten: true option, you would use the --src-options flag, as shown below. We'll also use the -d to turn on debug mode, so we can see the DDL being created.

$ sling run -d --src-conn AWS_S3 --src-stream s3://sling-bucket/records.json --src-options 'flatten: true' --tgt-conn BIGQUERY --tgt-object public.records --mode full-refresh
10:23PM INF connecting to target database (bigquery)
10:23PM INF reading from source file system (s3)
10:23PM DBG reading datastream from s3://sling-bucket/records.json
10:23PM INF writing to target database [mode: full-refresh]
10:23PM DBG drop table if exists public.records_tmp
10:23PM DBG table public.records_tmp dropped
10:23PM DBG create table public.records_tmp (`_id` string,
`age` int64,
`balance` string,
`company__about` string,
`company__address` string,
`company__email` string,
`company__latitude` numeric,
`company__longitude` numeric,
`company__name` string,
`company__phone` string,
`company__registered` string,
`isactive` bool,
`name` string,
`picture` string,
`tags` json,
`_sling_loaded_at` int64)
10:23PM INF streaming data
10:23PM INF importing into bigquery via local storage
10:23PM DBG Loading /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/bigquery/public.records_tmp/2022-12-07T222336.731/part.01.0001.csv.gz
10:23PM DBG select count(1) cnt from public.records_tmp
10:23PM DBG comparing checksums []string{"_id", "age", "balance", "company__about", "company__address", "company__email", "company__latitude", "company__longitude", "company__name", "company__phone", "company__registered", "isactive", "name", "picture", "tags", "_sling_loaded_at"} vs []string{"_id", "age", "balance", "company__about", "company__address", "company__email", "company__latitude", "company__longitude", "company__name", "company__phone", "company__registered", "isActive", "name", "picture", "tags", "_sling_loaded_at"}: []string{"_id", "age", "balance", "company__about", "company__address", "company__email", "company__latitude", "company__longitude", "company__name", "company__phone", "company__registered", "isactive", "name", "picture", "tags", "_sling_loaded_at"}
10:23PM DBG drop table if exists public.records
10:23PM DBG table public.records dropped
10:23PM INF dropped table public.records
10:23PM DBG create table public.records (`_id` string,
`age` int64,
`balance` string,
`company__about` string,
`company__address` string,
`company__email` string,
`company__latitude` numeric,
`company__longitude` numeric,
`company__name` string,
`company__phone` string,
`company__registered` string,
`isactive` bool,
`name` string,
`picture` string,
`tags` json,
`_sling_loaded_at` int64)
10:23PM INF created table public.records
10:23PM DBG insert into `public`.`records` (`_id`, `age`, `balance`, `company__about`, `company__address`, `company__email`, `company__latitude`, `company__longitude`, `company__name`, `company__phone`, `company__registered`, `isactive`, `name`, `picture`, `tags`, `_sling_loaded_at`) select `_id`, `age`, `balance`, `company__about`, `company__address`, `company__email`, `company__latitude`, `company__longitude`, `company__name`, `company__phone`, `company__registered`, `isactive`, `name`, `picture`, `tags`, `_sling_loaded_at` from `public`.`records_tmp`
10:23PM DBG inserted rows into `public.records` from temp table `public.records_tmp`
10:23PM INF inserted 500 rows in 25 secs
10:23PM DBG drop table if exists public.records_tmp
10:23PM DBG table public.records_tmp dropped
10:23PM INF execution succeeded

Wonderful, notice the DDL:

create table public.records (`_id` string,
`age` int64,
`balance` string,
`company__about` string,
`company__address` string,
`company__email` string,
`company__latitude` numeric,
`company__longitude` numeric,
`company__name` string,
`company__phone` string,
`company__registered` string,
`isactive` bool,
`name` string,
`picture` string,
`tags` json,
`_sling_loaded_at` int64)

Conclusion

As demonstrated, Sling has a wide compatibility with various storage systems. You can not only ingest JSON files from cloud systems, but CSV, XML files as well as various database systems (such as PostgreSQL, RedShift, Oracle, SQL Server, Snowflake, and more). Give it a spin! If you have any questions, comments and/or facing issues, please feel free to email us at support @ slingdata.io.