Export Data From StarRocks into DuckDB
Use Sling to export data from StarRocks into DuckDB
Introduction
Let's look at how we can easily export data from StarRocks into a local DuckDB database with Sling, a flexible command-line interface (CLI) data integration tool that enables rapid extraction and loading of data directly from the terminal.
StarRocks
StarRocks is a powerful distributed, columnar storage database system designed for real-time analytics. It can scale to Petabytes and connects to systems like HDFS, Apache Spark, Apache Flink and Apache Kafka. It is backed by CelerData, a $60 million VC-funded startup, and aims to be an open-source replacement for Snowflake, BigQuery, and Redshift. This makes it suitable for many analytics use cases, such as business intelligence, ad hoc querying, real-time events and even machine learning / AI-driven data processing.
DuckDB
DuckDB on the other hand, is a lightweight database dubbed as the SQLite of data warehouses. It is a local database saved in a single file (like SQLite) and enables many kinds of analytical use cases right on your machine, since the data is stored in columnar fashion. This is especially convenient when one wants to join several tables and do local analytics without dealing with external latencies and workloads.
Export Data with Sling
As with have it, Sling can Read/Write from both of these databases. Let's go over the steps.
First, let's install Sling. See here to see how to do so. It is usually a simple command, such as:
# On Mac
brew install slingdata-io/sling/sling
# On Windows Powershell
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling
# On Linux
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
&& tar xf sling_linux_amd64.tar.gz \
&& rm -f sling_linux_amd64.tar.gz \
&& chmod +x sling
Next, we'll set up StarRocks. If you don't have a StarRocks instance running, you can quickly launch a development instance on your machine with docker and load data into it.
docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
Now we can create a connection for our StarRocks database with Sling and test connectivity. See here for more details on configuration.
$ sling conns set starrocks url="starrocks://root:@localhost:9030/sling"
10:01PM INF connection `starrocks` has been set in /Users/me/.sling/env.yaml. Please test with `sling conns test starrocks`
$ sling conns test starrocks
10:01PM INF success!
Now let's set up our DuckDB connection. Since it is an embedded database, we don't need to install anything. In fact, Sling will auto-download the binary to interface with DuckDB the first time. See here for more details on configuring DuckDB.
$ sling conns set duckdb url="duckdb:///tmp/sling/duck.db"
10:02PM INF connection `duckdb` has been set in /Users/me/.sling/env.yaml. Please test with `sling conns test duckdb`
$ sling conns test duckdb
10:02PM INF success!
Great, we're ready to export some data into DuckDB.
Replication
Let's assume you already have data in your StarRocks instance, in the database public
(see here on how to load data into StarRocks). If you'd like to export all the tables in that database, you'd create a replication like this:
# replication.yaml
source: starrocks
target: duckdb
defaults:
object: main.{stream_schema}_{stream_table}
mode: full-refresh
streams:
# all tables in schema `public`
public.*:
# only one table
salesforce.account:
object: main.salesforce_account
mode: full-refresh
We can run the replication like this:
sling run -r replication.yaml
Output
10:56PM INF Sling Replication [24 streams] | starrocks -> duckdb
10:56PM INF [1 / 24] running stream "public"."call_center"
10:56PM INF connecting to source database (starrocks)
10:56PM INF connecting to target database (duckdb)
10:56PM INF reading from source database
10:56PM INF writing to target database [mode: full-refresh]
10:56PM INF streaming data
10:56PM INF created table main
.public_call_center
10:56PM INF inserted 62347 rows into main
.public_call_center
in 10 secs [6234 r/s] [20.0 MB]
10:56PM INF execution succeeded
10:56PM INF [2 / 24] running stream "public"."catalog_page"
10:56PM INF connecting to source database (starrocks)
10:56PM INF connecting to target database (duckdb)
10:56PM INF reading from source database
10:56PM INF writing to target database [mode: full-refresh]
10:56PM INF streaming data
10:56PM INF created table main
.public_catalog_page
10:56PM INF inserted 11718 rows into main
.public_catalog_page
in 0 secs [14,126 r/s] [2.0 MB]
10:56PM INF execution succeeded
.......
11:15PM INF [24 / 24] running stream "salesforce"."account"
11:15PM INF connecting to source database (starrocks)
11:15PM INF connecting to target database (duckdb)
11:15PM INF reading from source database
11:15PM INF writing to target database [mode: full-refresh]
11:15PM INF streaming data
11:15PM INF created table main
.salesforce_account
11:15PM INF inserted 71654 rows into main
.salesforce_account
in 2 secs [30,073 r/s] [12 MB]
11:15PM INF execution succeeded
11:20PM INF Sling Replication Completed in 19m 34s | starrocks -> duckdb | 24 Successes | 0 Failures
So easy! See many more examples here: docs.slingdata.io/sling-cli/run/examples
Conclusion
We went over on how easy it was to install Sling, and export data from StarRocks into DuckDB. Feel free to check out other examples here: docs.slingdata.io.