Export Data From StarRocks into DuckDB

Export Data From StarRocks into DuckDB

Use Sling to export data from StarRocks into DuckDB

Introduction

Let's look at how we can easily export data from StarRocks into a local DuckDB database with Sling, a flexible command-line interface (CLI) data integration tool that enables rapid extraction and loading of data directly from the terminal.

StarRocks

StarRocks is a powerful distributed, columnar storage database system designed for real-time analytics. It can scale to Petabytes and connects to systems like HDFS, Apache Spark, Apache Flink and Apache Kafka. It is backed by CelerData, a $60 million VC-funded startup, and aims to be an open-source replacement for Snowflake, BigQuery, and Redshift. This makes it suitable for many analytics use cases, such as business intelligence, ad hoc querying, real-time events and even machine learning / AI-driven data processing.

DuckDB

DuckDB on the other hand, is a lightweight database dubbed as the SQLite of data warehouses. It is a local database saved in a single file (like SQLite) and enables many kinds of analytical use cases right on your machine, since the data is stored in columnar fashion. This is especially convenient when one wants to join several tables and do local analytics without dealing with external latencies and workloads.

Export Data with Sling

As with have it, Sling can Read/Write from both of these databases. Let's go over the steps.

First, let's install Sling. See here to see how to do so. It is usually a simple command, such as:

# On Mac
brew install slingdata-io/sling/sling

# On Windows Powershell
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# On Linux
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

Next, we'll set up StarRocks. If you don't have a StarRocks instance running, you can quickly launch a development instance on your machine with docker and load data into it.

docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu

Now we can create a connection for our StarRocks database with Sling and test connectivity. See here for more details on configuration.

$ sling conns set starrocks url="starrocks://root:@localhost:9030/sling"
10:01PM INF connection `starrocks` has been set in /Users/me/.sling/env.yaml. Please test with `sling conns test starrocks`

$ sling conns test starrocks
10:01PM INF success!

Now let's set up our DuckDB connection. Since it is an embedded database, we don't need to install anything. In fact, Sling will auto-download the binary to interface with DuckDB the first time. See here for more details on configuring DuckDB.

$ sling conns set duckdb url="duckdb:///tmp/sling/duck.db"
10:02PM INF connection `duckdb` has been set in /Users/me/.sling/env.yaml. Please test with `sling conns test duckdb`

$ sling conns test duckdb
10:02PM INF success!

Great, we're ready to export some data into DuckDB.

Replication

Let's assume you already have data in your StarRocks instance, in the database public(see here on how to load data into StarRocks). If you'd like to export all the tables in that database, you'd create a replication like this:

# replication.yaml
source: starrocks
target: duckdb

defaults:
    object: main.{stream_schema}_{stream_table}
    mode: full-refresh

streams:
  # all tables in schema `public`
  public.*:

 # only one table
  salesforce.account:
    object: main.salesforce_account
    mode: full-refresh

We can run the replication like this:

sling run -r replication.yaml

Output


10:56PM INF Sling Replication [24 streams] | starrocks -> duckdb

10:56PM INF [1 / 24] running stream "public"."call_center" 10:56PM INF connecting to source database (starrocks) 10:56PM INF connecting to target database (duckdb) 10:56PM INF reading from source database 10:56PM INF writing to target database [mode: full-refresh] 10:56PM INF streaming data 10:56PM INF created table main.public_call_center 10:56PM INF inserted 62347 rows into main.public_call_center in 10 secs [6234 r/s] [20.0 MB] 10:56PM INF execution succeeded

10:56PM INF [2 / 24] running stream "public"."catalog_page" 10:56PM INF connecting to source database (starrocks) 10:56PM INF connecting to target database (duckdb) 10:56PM INF reading from source database 10:56PM INF writing to target database [mode: full-refresh] 10:56PM INF streaming data 10:56PM INF created table main.public_catalog_page 10:56PM INF inserted 11718 rows into main.public_catalog_page in 0 secs [14,126 r/s] [2.0 MB] 10:56PM INF execution succeeded

.......

11:15PM INF [24 / 24] running stream "salesforce"."account" 11:15PM INF connecting to source database (starrocks) 11:15PM INF connecting to target database (duckdb) 11:15PM INF reading from source database 11:15PM INF writing to target database [mode: full-refresh] 11:15PM INF streaming data 11:15PM INF created table main.salesforce_account 11:15PM INF inserted 71654 rows into main.salesforce_account in 2 secs [30,073 r/s] [12 MB] 11:15PM INF execution succeeded

11:20PM INF Sling Replication Completed in 19m 34s | starrocks -> duckdb | 24 Successes | 0 Failures

So easy! See many more examples here: docs.slingdata.io/sling-cli/run/examples

Conclusion

We went over on how easy it was to install Sling, and export data from StarRocks into DuckDB. Feel free to check out other examples here: docs.slingdata.io.