Reading Apache Iceberg Data with Sling
Leveraging DuckDB for Efficient Iceberg Table Processing
We're excited to announce that Sling now supports reading the Apache Iceberg format, bringing enhanced data lake management capabilities to our users. This addition opens up new possibilities for efficient and flexible data handling in large-scale environments.
What is Sling?
Sling aims to augment the exporting/loading data process into a positive and potentially enjoyable experience. It offers both CLI and YAML-based configurations for easy setup and management of data flows, by focusing on 3 data types interfaces:
From File Systems to Databases
From Databases to Databases
From Databases to File Systems
The list of connections that Sling supports continues to grow. You can see the full list here, but it supports all the major platforms including Clickhouse, DuckDB, Google BigQuery, Google BigTable, MariaDB, MongoDB, MotherDuck, MySQL, Oracle, PostgreSQL, Prometheus, Redshift, Snowflake, SQL Server, SQLite, StarRocks and Trino.
What is Apache Iceberg?
Apache Iceberg is an open table format for huge analytic datasets. It's designed to improve on the limitations of older table formats, offering better performance, reliability, and flexibility for data lakes.
Advantages of Using Iceberg
Schema Evolution: Iceberg allows for easy schema changes without the need for data migration.
Partition Evolution: You can change partition schemes without rewriting data.
Time Travel: Query data as it existed at a specific point in time.
ACID Transactions: Ensures data consistency and reliability.
Performance: Optimized for fast queries on large datasets.
Iceberg's Popularity
Iceberg has gained significant traction in the data engineering community. It's used by major companies like Netflix, Apple, and Adobe, and is supported by popular data processing tools like Spark, Flink, and Presto.
How Sling Uses DuckDB to Read Iceberg
Under the hood, Sling leverages DuckDB's powerful Iceberg integration to read and process Iceberg tables efficiently. DuckDB is working on adding support for writing to Iceberg tables, and we're excited to see what new features this will bring to the table. Here's a brief overview of how this works:
DuckDB Integration: Sling utilizes DuckDB's built-in Iceberg reader, which allows for direct querying of Iceberg tables without the need for additional dependencies.
Iceberg Scan Function: When reading an Iceberg table, Sling uses DuckDB's
iceberg_scan
function. This function is capable of reading Iceberg metadata and data files directly.Query Generation: Sling constructs a SQL query using the
iceberg_scan
function. For example:SELECT * FROM iceberg_scan('path/to/iceberg/table', allow_moved_paths = true)
The
allow_moved_paths
option is set to true to handle cases where data files might have been moved.Column Projection: When specific columns are requested, Sling modifies the query to select only those columns, optimizing read performance.
Type Mapping: Sling maps Iceberg/DuckDB types to its internal column types for consistent data handling across different sources.
Metadata Retrieval: Before executing the main query, Sling uses DuckDB to fetch table metadata, including schema information, by running a
DESCRIBE
query on the Iceberg scan.Streaming Results: Sling streams the results from DuckDB, allowing for efficient processing of large Iceberg tables without loading the entire dataset into memory.
This approach allows Sling to provide seamless support for Iceberg tables, leveraging DuckDB's optimized Iceberg reader while maintaining Sling's flexible and user-friendly interface.
Reading Iceberg with Sling CLI
To work with Iceberg files in Sling, you can use the following CLI flags:
# read a local iceberg table
sling run --src-stream file://path/to/table \
--src-options '{format: iceberg}' \
--stdout --limit 100
# read an iceberg table from aws s3
sling run --src-conn aws_s3 \
--src-stream path/to/table \
--src-options '{format: iceberg}' \
--stdout --limit 100
This command reads an Iceberg table located at path/to/table
and outputs the results to the console (limited to 100 rows).
# read a local iceberg table, write to bigquery
sling run --src-stream file://path/to/table \
--src-options '{format: iceberg}' \
--tgt-conn bigquery \
--tgt-object bq_schema.bq_table
This command reads an Iceberg table located at path/to/table
and writes it to our BigQuery connection.
Reading Iceberg in Replication YAML
You can also specify Iceberg as a format in your replication.yaml
file:
source: aws_s3
target: postgres
defaults:
mode: full-refresh
source_options:
format: iceberg
streams:
path/to/iceberg/table:
object: my_schema.iceberg_table
This configuration will read data from an Iceberg table stored in AWS S3 and load it into a PostgreSQL table named my_schema.iceberg_table
. The full-refresh
mode indicates that the target table will be completely replaced with the data from the source Iceberg table during each replication run. We can also use wildcards to read multiple Iceberg tables. See the variables docs for a complete list of variables.
source: aws_s3
target: postgres
defaults:
mode: full-refresh
source_options:
format: iceberg
streams:
path/to/iceberg_tables/*:
object: my_schema.{stream_file_name}
path/to/more_iceberg_tables/prefix*:
object: my_schema.{stream_file_name}
Running a replication is easily done with the sling run
command:
sling run -d -r replication.yaml
See docs here to get started with Sling!
Conclusion
Iceberg support in Sling, powered by DuckDB, offers a powerful and efficient way to work with large-scale data stored in the Iceberg format. By leveraging Sling's intuitive configuration and DuckDB's performance, users can easily integrate Iceberg tables into their data pipelines and analytics workflows.
As the Apache Iceberg ecosystem continues to grow and evolve, we anticipate expanding Sling's capabilities to include writing to Iceberg tables and supporting more advanced Iceberg features. This will further enhance Sling's position as a versatile tool for modern data engineering tasks.
We encourage users to explore the Iceberg integration in Sling and provide feedback. Your input is valuable in shaping the future development of this feature and ensuring it meets the diverse needs of the data community.