How to Quickly Import Local JSON files into PostgreSQL

How to Quickly Import Local JSON files into PostgreSQL

Need to flatten and ingest your JSON file? See how we can load JSON files into databases from your local drive using a nifty tool called Sling.

The Most Versatile Database System

PostgreSQL was not always as popular as it is today. Over the years, the open-source database has increasingly seen adoption and praise as it has been an extremely stable, versatile and cost-efficient system. Many teams even use it as their de facto Data Warehouse. It can easily handle terabytes of data, which is enough for 90% of businesses out there.

JSON File Structure

JSON itself is a very popular file format which grew in popularity with the rise of the internet. Due to its ease of readability and flexibility with any desired tree structure, it is used across systems, especially as a means to deliver messages. It is also often used when storing log messages, so it makes sense to have advanced capabilities in reading JSON files.

Using Sling CLI

Sling CLI is a powerful tool which allows you to efficiently move data from files into databases, from databases to databases, and from databases to files. One of the file type that it can read is JSON. Please see here for the full list of compatible connectors.

Installation

Installing Sling is super-easy regardless of the operating system that you are using. Since it is built in go, it compiles to a single binary.

# On Mac
brew install slingdata-io/sling/sling

# On Windows Powershell
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# On Linux
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

Please see here for additional installation options (such as downloading binaries). There is also a Python wrapper library, which is useful if you prefer interacting with Sling inside of Python. Once installed, we should be able to run the sling command.

Loading from Local Drive

Let us assume that we desire to ingest the following JSON array file, which includes nested objects:

[
 {
   "_id": "638f4bed04fa9d18613af369",
   "isActive": false,
   "balance": "1,916.14",
   "picture": "http://placehold.it/32x32",
   "age": 30,
   "name": "Ebony Jensen",
   "company": {
     "name": "ZAPHIRE",
     "email": "ebonyjensen@zaphire.com",
     "phone": "+1 (980) 571-3202",
     "address": "958 Elmwood Avenue, Gouglersville, Massachusetts, 9126",
     "about": "Sunt velit adipisicing dolore aliqua aliquip. Amet aliqua Lorem ea est laboris magna. Commodo aliqua sint cupidatat qui ut cupidatat nostrud sunt proident duis sunt commodo eu.\r\n",
     "registered": "2018-05-06T02:20:10 +03:00",
     "latitude": -1.126455,
     "longitude": -161.187586
   },
   "tags": [
     "minim",
     "commodo"
   ]
 },
...
]

If the file is located on our local hard drive (at path /tmp/records.json), it’s as easy as running the following command below. If you'd like to ingest numerous files inside a folder (say /path/to/my/folder), you could just input that instead of the file path, Sling will read all files in the folder. Only make sure to add the file:// prefix. See below.

# first let's set our PG connection. Sling can pick up connection URLs from environment variables
$ export POSTGRES='postgresql://postgres:myPassword@pghost:5432/postgres?sslmode=disable'

# let's check and test our PG connection
$ sling conns list
+------------+------------------+-----------------+
| CONN NAME  | CONN TYPE        | SOURCE          |
+------------+------------------+-----------------+
| POSTGRES   | DB - PostgreSQL  | env variable    |
+------------+------------------+-----------------+

$ sling conns test POSTGRES
8:05AM INF success!

# great, now we can run our task
$ sling run --src-stream file:///tmp/records.json --tgt-conn POSTGRES --tgt-object public.records --mode full-refresh
11:09AM INF connecting to target database (postgres)
11:09AM INF reading from source file system (file)
11:09AM INF writing to target database [mode: full-refresh]
11:09AM INF streaming data
11:09AM INF dropped table public.records
11:09AM INF created table public.records
11:09AM INF inserted 500 rows in 0 secs [1,556 r/s]
11:09AM INF execution succeeded

Easy, huh? Let's do this again in debug mode (with flag -d) to see the created table DDL, and this time let's pipe in the data with the cat command:

$ cat /tmp/records.json | sling run -d --tgt-conn POSTGRES --tgt-object public.records --mode full-refresh
11:10AM DBG type is file-db
11:10AM INF connecting to target database (postgres)
11:10AM INF reading from stream (stdin)
11:10AM DBG reading datastream from /tmp/records.json
11:10AM INF writing to target database [mode: full-refresh]
11:10AM DBG drop table if exists public.records_tmp cascade
11:10AM DBG table public.records_tmp dropped
11:10AM DBG create table if not exists public.records_tmp ("data" jsonb)
11:10AM INF streaming data
11:10AM DBG select count(1) cnt from public.records_tmp
11:10AM DBG drop table if exists public.records cascade
11:10AM DBG table public.records dropped
11:10AM INF dropped table public.records
11:10AM DBG create table if not exists public.records ("data" jsonb)
11:10AM INF created table public.records
11:10AM DBG insert into "public"."records" ("data") select "data" from "public"."records_tmp"
11:10AM DBG inserted rows into `public.records` from temp table `public.records_tmp`
11:10AM INF inserted 500 rows in 0 secs [1,778 r/s]
11:10AM DBG drop table if exists public.records_tmp cascade
11:10AM DBG table public.records_tmp dropped
11:10AM INF execution succeeded

We can see the DDL is create table if not exists public.records ("data" jsonb).

Flattening our JSON Records

Now, let's flatten the same file when ingesting. When we flatten, Sling will create individual columns for each of the keys in the records. We can do so by adding --src-options 'flatten: true' as a flag. See here for all options:

$ cat /tmp/records.json | sling run -d --src-options 'flatten: true' --tgt-conn POSTGRES --tgt-object public.records --mode full-refresh
11:13AM DBG type is file-db
11:13AM INF connecting to target database (postgres)
11:13AM INF reading from stream (stdin)
11:13AM DBG reading datastream from /tmp/records.json
11:13AM INF writing to target database [mode: full-refresh]
11:13AM DBG drop table if exists public.records_tmp cascade
11:13AM DBG table public.records_tmp dropped
11:13AM DBG create table if not exists public.records_tmp ("_id" varchar(255),
"age" integer,
"balance" varchar(255),
"company__about" text,
"company__address" varchar(255),
"company__email" varchar(255),
"company__latitude" numeric,
"company__longitude" numeric,
"company__name" varchar(255),
"company__phone" varchar(255),
"company__registered" varchar(255),
"isactive" bool,
"name" varchar(255),
"picture" varchar(255),
"tags" jsonb)
11:13AM INF streaming data
11:13AM DBG select count(1) cnt from public.records_tmp
11:13AM DBG drop table if exists public.records cascade
11:13AM DBG table public.records dropped
11:13AM INF dropped table public.records
11:13AM DBG create table if not exists public.records ("_id" varchar(255),
"age" integer,
"balance" varchar(255),
"company__about" text,
"company__address" varchar(255),
"company__email" varchar(255),
"company__latitude" numeric,
"company__longitude" numeric,
"company__name" varchar(255),
"company__phone" varchar(255),
"company__registered" varchar(255),
"isactive" bool,
"name" varchar(255),
"picture" varchar(255),
"tags" jsonb)
11:13AM INF created table public.records
11:13AM DBG insert into "public"."records" ("_id", "age", "balance", "company__about", "company__address", "company__email", "company__latitude", "company__longitude", "company__name", "company__phone", "company__registered", "isactive", "name", "picture", "tags") select "_id", "age", "balance", "company__about", "company__address", "company__email", "company__latitude", "company__longitude", "company__name", "company__phone", "company__registered", "isactive", "name", "picture", "tags" from "public"."records_tmp"
11:13AM DBG inserted rows into `public.records` from temp table `public.records_tmp`
11:13AM INF inserted 500 rows in 0 secs [1,153 r/s]
11:13AM DBG drop table if exists public.records_tmp cascade
11:13AM DBG table public.records_tmp dropped
11:13AM INF execution succeeded

Nice, notice the DDL:

create table if not exists public.records ("_id" varchar(255),
"age" integer,
"balance" varchar(255),
"company__about" text,
"company__address" varchar(255),
"company__email" varchar(255),
"company__latitude" numeric,
"company__longitude" numeric,
"company__name" varchar(255),
"company__phone" varchar(255),
"company__registered" varchar(255),
"isactive" bool,
"name" varchar(255),
"picture" varchar(255),
"tags" jsonb)

Conclusion

As demonstrated, Sling CLI is an effective tool to quickly ingest JSON data. You can not only ingest JSON files, but CSV, XML files as well as various database systems. If you have any questions, or comments and/or facing issues, please feel free to email us at support @ slingdata.io.