Photo by Piotr Łaskawski on Unsplash
How does Sling deal with schema evolution?
See how Sling handles a drifting table schema
Table of contents
Introduction
Sling is the go-to tool to quickly extract and load data, whether you're handling CSV, Parquet, Avro or JSON files, moving data across databases, or even extracting custom queries, right from your command line. For more details, feel free to browse the docs. When it comes to a drifting table schema, Sling is no exception in detecting those changes.
What is schema evolution?
To put it simply, schema evolution in databases refers to the process of modifying the structure of a database schema over time to accommodate changes in requirements, business rules, or application enhancements. This is something that happens frequently, and needs to be handled carefully.
Slinging like a champ
When using Sling to extract/load data in a incremental
manner, it will attempt to match whatever columns are present in both the source stream and target table. If an extra column is present in the source stream, it will add it in the target table. If no columns match from source stream at all, it will error. At least the primary_key
or update_key
must be present in the target table.
See below for a simple example, mimicking the addition and removal of columns.
# Initial data
$ echo 'a,b,c
1,2,3
4,5,6' > test1.csv
$ sling run \
--src-stream file://./test1.csv \
--tgt-conn postgres \
--tgt-object public.test1
<...log output omitted>
$ sling run \
--src-conn postgres \
--src-stream public.test1 \
--stdout
a,b,c,_sling_loaded_at
1,2,3,1707869559
4,5,6,1707869559
# test2.csv is missing column b
echo 'a,c
7,8' > test2.csv
$ sling run \
--src-stream file://./test2.csv \
--tgt-conn postgres \
--tgt-object public.test1 \
--mode incremental \
--primary-key a
<...log output omitted>
$ sling run \
--src-conn postgres \
--src-stream public.test1 \
--stdout
a,b,c,_sling_loaded_at
1,2,3,1707869559
4,5,6,1707869559
7,,8,1707869689
# test3.csv is missing column b, c and has extra column d
$ echo 'a,d
9,10' > test3.csv
$ sling run \
--src-stream file://./test3.csv \
--tgt-conn postgres \
--tgt-object public.test1 \
--mode incremental \
--primary-key a
<...log output omitted>
$ sling run \
--src-conn postgres \
--src-stream public.test1 \
--stdout
a,b,c,_sling_loaded_at,d
1,2,3,1707869559,
4,5,6,1707869559,
7,,8,1707869689,
9,,,1707870320,10
We can see that sling handled the changes properly, in a non-destructive manner. If the source stream were from a database, the same rules would apply, whether a column disappeared or appeared.