Reading Apache Iceberg Data with Sling

Reading Apache Iceberg Data with Sling

Leveraging DuckDB for Efficient Iceberg Table Processing

We're excited to announce that Sling now supports reading the Apache Iceberg format, bringing enhanced data lake management capabilities to our users. This addition opens up new possibilities for efficient and flexible data handling in large-scale environments.

What is Sling?

Sling aims to augment the exporting/loading data process into a positive and potentially enjoyable experience. It offers both CLI and YAML-based configurations for easy setup and management of data flows, by focusing on 3 data types interfaces:

  • From File Systems to Databases

  • From Databases to Databases

  • From Databases to File Systems

The list of connections that Sling supports continues to grow. You can see the full list here, but it supports all the major platforms including Clickhouse, DuckDB, Google BigQuery, Google BigTable, MariaDB, MongoDB, MotherDuck, MySQL, Oracle, PostgreSQL, Prometheus, Redshift, Snowflake, SQL Server, SQLite, StarRocks and Trino.

What is Apache Iceberg?

Apache Iceberg is an open table format for huge analytic datasets. It's designed to improve on the limitations of older table formats, offering better performance, reliability, and flexibility for data lakes.

Advantages of Using Iceberg

  1. Schema Evolution: Iceberg allows for easy schema changes without the need for data migration.

  2. Partition Evolution: You can change partition schemes without rewriting data.

  3. Time Travel: Query data as it existed at a specific point in time.

  4. ACID Transactions: Ensures data consistency and reliability.

  5. Performance: Optimized for fast queries on large datasets.

Iceberg's Popularity

Iceberg has gained significant traction in the data engineering community. It's used by major companies like Netflix, Apple, and Adobe, and is supported by popular data processing tools like Spark, Flink, and Presto.

How Sling Uses DuckDB to Read Iceberg

Under the hood, Sling leverages DuckDB's powerful Iceberg integration to read and process Iceberg tables efficiently. DuckDB is working on adding support for writing to Iceberg tables, and we're excited to see what new features this will bring to the table. Here's a brief overview of how this works:

  1. DuckDB Integration: Sling utilizes DuckDB's built-in Iceberg reader, which allows for direct querying of Iceberg tables without the need for additional dependencies.

  2. Iceberg Scan Function: When reading an Iceberg table, Sling uses DuckDB's iceberg_scan function. This function is capable of reading Iceberg metadata and data files directly.

  3. Query Generation: Sling constructs a SQL query using the iceberg_scan function. For example:

     SELECT * FROM iceberg_scan('path/to/iceberg/table', allow_moved_paths = true)
    

    The allow_moved_paths option is set to true to handle cases where data files might have been moved.

  4. Column Projection: When specific columns are requested, Sling modifies the query to select only those columns, optimizing read performance.

  5. Type Mapping: Sling maps Iceberg/DuckDB types to its internal column types for consistent data handling across different sources.

  6. Metadata Retrieval: Before executing the main query, Sling uses DuckDB to fetch table metadata, including schema information, by running a DESCRIBE query on the Iceberg scan.

  7. Streaming Results: Sling streams the results from DuckDB, allowing for efficient processing of large Iceberg tables without loading the entire dataset into memory.

This approach allows Sling to provide seamless support for Iceberg tables, leveraging DuckDB's optimized Iceberg reader while maintaining Sling's flexible and user-friendly interface.

Reading Iceberg with Sling CLI

To work with Iceberg files in Sling, you can use the following CLI flags:

# read a local iceberg table
sling run --src-stream file://path/to/table \
  --src-options '{format: iceberg}' \
  --stdout --limit 100

# read an iceberg table from aws s3
sling run --src-conn aws_s3 \
  --src-stream path/to/table \
  --src-options '{format: iceberg}' \
  --stdout --limit 100

This command reads an Iceberg table located at path/to/table and outputs the results to the console (limited to 100 rows).

# read a local iceberg table, write to bigquery
sling run --src-stream file://path/to/table \
  --src-options '{format: iceberg}' \
  --tgt-conn bigquery \
  --tgt-object bq_schema.bq_table

This command reads an Iceberg table located at path/to/table and writes it to our BigQuery connection.

Reading Iceberg in Replication YAML

You can also specify Iceberg as a format in your replication.yaml file:

source: aws_s3
target: postgres

defaults:
  mode: full-refresh
  source_options:
    format: iceberg

streams:
  path/to/iceberg/table:
    object: my_schema.iceberg_table

This configuration will read data from an Iceberg table stored in AWS S3 and load it into a PostgreSQL table named my_schema.iceberg_table. The full-refresh mode indicates that the target table will be completely replaced with the data from the source Iceberg table during each replication run. We can also use wildcards to read multiple Iceberg tables. See the variables docs for a complete list of variables.

source: aws_s3
target: postgres

defaults:
  mode: full-refresh
  source_options:
    format: iceberg

streams:
  path/to/iceberg_tables/*:
    object: my_schema.{stream_file_name}

  path/to/more_iceberg_tables/prefix*:
    object: my_schema.{stream_file_name}

Running a replication is easily done with the sling run command:

sling run -d -r replication.yaml

See docs here to get started with Sling!

Conclusion

Iceberg support in Sling, powered by DuckDB, offers a powerful and efficient way to work with large-scale data stored in the Iceberg format. By leveraging Sling's intuitive configuration and DuckDB's performance, users can easily integrate Iceberg tables into their data pipelines and analytics workflows.

As the Apache Iceberg ecosystem continues to grow and evolve, we anticipate expanding Sling's capabilities to include writing to Iceberg tables and supporting more advanced Iceberg features. This will further enhance Sling's position as a versatile tool for modern data engineering tasks.

We encourage users to explore the Iceberg integration in Sling and provide feedback. Your input is valuable in shaping the future development of this feature and ensuring it meets the diverse needs of the data community.