<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Sling Data Blog]]></title><description><![CDATA[This is the blog for Slingdata.io, which covers many use-cases for efficiently moving data from one platform to another via ELT.]]></description><link>https://blog.slingdata.io</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1765735786800/d3ab959f-ca22-4bff-8bf8-59d1448abfd4.png</url><title>Sling Data Blog</title><link>https://blog.slingdata.io</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 20 Apr 2026 03:05:37 GMT</lastBuildDate><atom:link href="https://blog.slingdata.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Extract data from Databases into DuckLake]]></title><description><![CDATA[Extract data from Databases into DuckLake
In the ever-evolving landscape of data engineering, the tools we use are constantly getting better, faster, and more efficient. DuckLake is one such innovation, building on the phenomenal success of DuckDB to...]]></description><link>https://blog.slingdata.io/extract-data-from-databases-into-ducklake</link><guid isPermaLink="true">https://blog.slingdata.io/extract-data-from-databases-into-ducklake</guid><category><![CDATA[ETL]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[duckDB]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Sat, 28 Jun 2025 09:30:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751104314209/75ec2d7d-662e-4151-9718-7bdba35f0cf2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-extract-data-from-databases-into-ducklake">Extract data from Databases into DuckLake</h1>
<p>In the ever-evolving landscape of data engineering, the tools we use are constantly getting better, faster, and more efficient. DuckLake is one such innovation, building on the phenomenal success of DuckDB to offer a robust, ACID-compliant data lake format. It's designed for scalability and flexibility, supporting various backends for both its catalog and data storage. </p>
<p>But how do you get your data <em>into</em> DuckLake? Whether your data lives in a production PostgreSQL database, a MySQL instance, or any other database, you need a simple and powerful way to extract and load it.</p>
<p>This is where Sling comes in. Sling is a modern data movement tool designed to make transferring data between different sources and destinations as easy as possible. In this article, we'll walk you through how to use Sling to extract data from most databases and load it directly into your DuckLake instance. The list of connections that Sling supports continues to grow. You can see the <a target="_blank" href="https://slingdata.io/en/connectors">full list</a> here, but it supports all the major platforms including <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/clickhouse">Clickhouse</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/duckdb">DuckDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigquery">Google BigQuery</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigtable">Google BigTable</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mariadb">MariaDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mongodb">MongoDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/motherduck">MotherDuck</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mysql">MySQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/oracle">Oracle</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/postgres">PostgreSQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">Prometheus</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/redshift">Redshift</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/snowflake">Snowflake</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlserver">SQL Server</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlite">SQLite</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">StarRocks</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/databricks">Databricks</a> and more.</p>
<h3 id="heading-what-is-ducklake">What is DuckLake?</h3>
<p>Before we dive in, let's quickly recap what makes DuckLake special. DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables, which brings database-like reliability to your data lake. See Ducklake's website for more details: https://ducklake.select/.</p>
<p>Key features include:</p>
<ul>
<li><strong>Flexible Catalog:</strong> Use DuckDB, SQLite, PostgreSQL, or MySQL as your catalog backend. </li>
<li><strong>Scalable Storage:</strong> Store your data files locally, or in cloud storage like AWS S3, Azure Blob Storage, or Google Cloud Storage. </li>
</ul>
<h3 id="heading-step-1-configure-your-connections">Step 1: Configure Your Connections</h3>
<p>First things first, we need to tell Sling how to connect to our source database and our target DuckLake instance. We'll use the <code>sling conns</code> command, which makes managing connections a breeze. </p>
<h4 id="heading-source-database-postgresql">Source Database: PostgreSQL</h4>
<p>Let's assume our source data is in a PostgreSQL database. See the complete list of databases sling can connect to <a target="_blank" href="https://docs.slingdata.io/connections/database-connections">here</a>.</p>
<p>We can set up a connection named <code>PG_CONN</code> like this:</p>
<pre><code class="lang-bash">sling conns <span class="hljs-built_in">set</span> PG_CONN <span class="hljs-built_in">type</span>=postgres host=mypg.host user=myuser password=mypass port=5432 database=analytics

<span class="hljs-comment"># or use environment variable</span>
<span class="hljs-built_in">export</span> PG_CONN=<span class="hljs-string">'postgresql://myuser:mypass@mypg.host:5432/analytics?sslmode=require'</span>
</code></pre>
<p>You can then run <code>sling conns test pg_conn</code> to ensure it can successfully connect.</p>
<h4 id="heading-target-ducklake">Target: DuckLake</h4>
<p>Configuring DuckLake involves specifying the catalog and the data storage path. See the documentation for details <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/ducklake">here</a>.</p>
<p>For this example, we'll use a local SQLite file for the catalog and a local directory for the data. This setup is great for multi-client access on the same machine.</p>
<p>Here's how to set up a connection named <code>DUCKLAKE_CONN</code>:</p>
<pre><code class="lang-bash">sling conns <span class="hljs-built_in">set</span> DUCKLAKE_CONN <span class="hljs-built_in">type</span>=ducklake catalog_type=sqlite catalog_conn_string=ducklake_catalog.db data_path=./ducklake_data

<span class="hljs-comment"># or use environment variable</span>
<span class="hljs-built_in">export</span> DUCKLAKE_CONN=<span class="hljs-string">'{ 
  type: ducklake, 
  catalog_type: sqlite,
  catalog_conn_string: "ducklake_catalog.db",
  data_path: "./ducklake_data"
}'</span>
</code></pre>
<p>After setting these, you can verify they're configured correctly by running <code>sling conns list</code>.</p>
<p>If you used the <code>sling conns set</code> command, your <a target="_blank" href="https://docs.slingdata.io/sling-cli/environment#sling-env-file-env.yaml"><code>~/.sling/env.yaml</code></a> file should now contain these configurations:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">connections:</span>
  <span class="hljs-attr">PG_CONN:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">postgres</span>
    <span class="hljs-attr">host:</span> <span class="hljs-string">mypg.host</span>
    <span class="hljs-attr">user:</span> <span class="hljs-string">myuser</span>
    <span class="hljs-attr">password:</span> <span class="hljs-string">mypass</span>
    <span class="hljs-attr">port:</span> <span class="hljs-number">5432</span>
    <span class="hljs-attr">database:</span> <span class="hljs-string">analytics</span>

  <span class="hljs-attr">DUCKLAKE_CONN:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">ducklake</span>
    <span class="hljs-attr">catalog_type:</span> <span class="hljs-string">sqlite</span>
    <span class="hljs-attr">catalog_conn_string:</span> <span class="hljs-string">ducklake_catalog.db</span>
    <span class="hljs-attr">data_path:</span> <span class="hljs-string">./ducklake_data</span>
</code></pre>
<h3 id="heading-step-2-create-the-replication-yaml">Step 2: Create the Replication YAML</h3>
<p>Now for the fun part. We'll define our data movement task in a simple YAML file. Sling's replication configs are powerful because you can define defaults and replicate many streams (e.g., tables) at once. See <a target="_blank" href="https://docs.slingdata.io/concepts/replication">here</a> for full documentation.</p>
<p>Let's create a file named <code>db_to_ducklake.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">source:</span> <span class="hljs-string">pg_conn</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">ducklake_conn</span>

<span class="hljs-attr">defaults:</span>
  <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>
  <span class="hljs-attr">object:</span> <span class="hljs-string">'{stream_schema}.{stream_table}'</span> <span class="hljs-comment"># Dynamically name tables in DuckLake</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-comment"># Replicate all tables from the 'public' schema</span>
  <span class="hljs-string">public.*:</span>

  <span class="hljs-comment"># You can also add specific tables or disable some</span>
  <span class="hljs-attr">public.forbidden:</span>
    <span class="hljs-attr">disabled:</span> <span class="hljs-literal">true</span>

  <span class="hljs-attr">analytics.users:</span>
</code></pre>
<p>A few things to note here:</p>
<ul>
<li><code>source</code> and <code>target</code> refer to the connection names we just set up.</li>
<li><code>defaults</code> applies the <code>full-refresh</code> mode to all our streams. This means the target tables in DuckLake will be dropped and recreated on each run.</li>
<li>The <code>object</code> name is the target table, and uses runtime variables <code>{stream_schema}</code> and <code>{stream_table}</code>. Sling will dynamically replace these with the actual schema and table names from the source.</li>
<li>The real power move is <code>public.*</code>. This single line tells Sling to find all tables in the <code>public</code> schema of our PostgreSQL database and replicate every single one of them.</li>
</ul>
<h3 id="heading-step-3-run-the-replication">Step 3: Run the Replication</h3>
<p>With our connections and replication file ready, all that's left is to run it:</p>
<pre><code class="lang-bash">sling run -r db_to_ducklake.yaml
</code></pre>
<p>That's it! Sling will connect to your PostgreSQL database, read the tables from the <code>public</code> and <code>analytics</code> schemas, and write the data into your DuckLake instance, creating the tables and structuring the data files in the <code>./ducklake_data</code> directory.</p>
<h3 id="heading-going-further-incremental-loads-and-cloud-storage">Going Further: Incremental Loads and Cloud Storage</h3>
<p>This example is just the beginning. You can easily adapt this for more advanced use cases:</p>
<ul>
<li><strong>Incremental Loads:</strong> Change the <code>mode</code> to <code>incremental</code> and specify a <code>primary_key</code> and/or <code>update_key</code> to only process new or updated records, making your pipelines much more efficient.</li>
<li><strong>Cloud Storage:</strong> To use cloud storage for your data, simply update your DuckLake connection's <code>data_path</code> to an S3, GCS, or Azure URI and provide the necessary credentials.</li>
</ul>
<p>For example, to use an S3 bucket, your DuckLake connection in <code>env.yaml</code> might look like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">connections:</span>
  <span class="hljs-attr">DUCKLAKE_CONN_S3:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">ducklake</span>
    <span class="hljs-attr">catalog_type:</span> <span class="hljs-string">postgres</span>
    <span class="hljs-attr">catalog_conn_string:</span> <span class="hljs-string">"host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog"</span>
    <span class="hljs-attr">data_path:</span> <span class="hljs-string">"s3://my-data-lake-bucket/data/"</span>
    <span class="hljs-attr">s3_access_key_id:</span> <span class="hljs-string">"AKIA..."</span>
    <span class="hljs-attr">s3_secret_access_key:</span> <span class="hljs-string">"xxxx"</span>
</code></pre>
<p>And here's a replication with <a target="_blank" href="https://docs.slingdata.io/concepts/replication/modes">incremental mode</a> and custom SQL:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">source:</span> <span class="hljs-string">pg_conn</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">ducklake_conn</span>

<span class="hljs-attr">streams:</span>

  <span class="hljs-attr">analytics.users:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">'{stream_schema}.{stream_table}'</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">incremental</span>
    <span class="hljs-attr">primary_key:</span> <span class="hljs-string">id</span>
    <span class="hljs-attr">update_key:</span> <span class="hljs-string">updated_at</span>

  <span class="hljs-attr">custom_stream:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">another_schema.report</span>
    <span class="hljs-attr">sql:</span> <span class="hljs-string">|
      select ...
      from ...</span>
</code></pre>
<h3 id="heading-conclusion">Conclusion</h3>
<p>DuckLake brings exciting new capabilities to the world of data lakes, and with Sling, populating it from your existing databases is incredibly straightforward. With just a few lines of configuration, you can build scalable, repeatable, and robust data pipelines to feed your modern data stack.</p>
<p>Ready to give it a try? <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">Install the Sling CLI</a> and check out the <a target="_blank" href="https://docs.slingdata.io/">official documentation</a> to get started.</p>
]]></content:encoded></item><item><title><![CDATA[Introducing the Sling Data Platform]]></title><description><![CDATA[The modern data landscape is complex and challenging. Organizations need to move data between various sources and destinations, transform it along the way, and ensure everything runs smoothly in production. Setting up data pipelines traditionally inv...]]></description><link>https://blog.slingdata.io/introducing-the-sling-data-platform</link><guid isPermaLink="true">https://blog.slingdata.io/introducing-the-sling-data-platform</guid><category><![CDATA[ETL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Mon, 02 Dec 2024 10:18:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733134280016/38fd9214-3f32-4c06-9c30-ef6584f92ff5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The modern data landscape is complex and challenging. Organizations need to move data between various sources and destinations, transform it along the way, and ensure everything runs smoothly in production. Setting up data pipelines traditionally involves dealing with multiple tools, complex configurations, and ongoing maintenance headaches.</p>
<p>This is where Sling comes in. As a modern data movement and transformation platform, Sling dramatically simplifies the entire process of setting up and managing data pipelines. Whether you're moving data between databases, synchronizing data warehouses, or transforming data during transfer, Sling provides an elegant solution that works right out of the box.</p>
<h2 id="heading-the-data-pipeline-challenge">The Data Pipeline Challenge</h2>
<p>Building data pipelines traditionally involves numerous challenges:</p>
<ul>
<li><p>Complex setup procedures requiring extensive configuration</p>
</li>
<li><p>Managing multiple tools and technologies</p>
</li>
<li><p>Ensuring data consistency and reliability</p>
</li>
<li><p>Handling different data formats and schemas</p>
</li>
<li><p>Monitoring and maintaining pipelines in production</p>
</li>
<li><p>Scaling operations as data volumes grow</p>
</li>
</ul>
<p>These challenges often lead to increased development time, higher maintenance costs, and reliability issues. Teams spend more time troubleshooting infrastructure than focusing on valuable data insights.</p>
<h2 id="heading-enter-sling-a-modern-solution">Enter Sling: A Modern Solution</h2>
<p>Sling addresses these challenges head-on by providing:</p>
<ul>
<li><p>A unified platform for all database/file system movement needs</p>
</li>
<li><p>Simple, intuitive interfaces through both CLI and UI</p>
</li>
<li><p>Built-in support for numerous databases and storage systems</p>
</li>
<li><p>Automatic schema handling and data type mapping</p>
</li>
<li><p>Production-ready features like monitoring and scheduling</p>
</li>
<li><p>Scalable architecture that grows with your needs</p>
</li>
</ul>
<p>Let's dive deeper into the Sling platform and discover how it can transform your data operations.</p>
<h2 id="heading-understanding-sling-data-platform">Understanding Sling Data Platform</h2>
<p>Sling is a comprehensive data movement and transformation platform designed with modern data needs in mind. At its core, Sling combines powerful functionality with user-friendly interfaces, making it accessible to both developers and data teams.</p>
<h3 id="heading-key-benefits">Key Benefits</h3>
<ol>
<li><p><strong>Simplified Setup</strong></p>
<ul>
<li><p>Simple configuration for many common scenarios</p>
</li>
<li><p>Intuitive YAML-based configuration for complex cases</p>
</li>
<li><p>Visual interface for pipeline creation and management</p>
</li>
</ul>
</li>
<li><p><strong>Reduced Development Time</strong></p>
<ul>
<li><p>Pre-built connectors for popular databases and storage systems</p>
</li>
<li><p>Automated schema handling and type mapping</p>
</li>
<li><p>Built-in transformation capabilities</p>
</li>
</ul>
</li>
<li><p><strong>Enhanced Reliability</strong></p>
<ul>
<li><p>Robust error handling and retry mechanisms</p>
</li>
<li><p>Comprehensive logging and monitoring</p>
</li>
<li><p>Production-grade performance</p>
</li>
</ul>
</li>
<li><p><strong>Scalable Operations</strong></p>
<ul>
<li><p>Distributed agent architecture</p>
</li>
<li><p>Efficient resource utilization</p>
</li>
</ul>
</li>
</ol>
<p>Let's explore the main components that make up the Sling platform and see how they work together to provide a seamless data movement experience.</p>
<h2 id="heading-platform-architecture">Platform Architecture</h2>
<p>Sling's architecture consists of three main components that work together seamlessly: the Platform UI, the Control Server, and the Agents.</p>
<p><img src="https://docs.slingdata.io/~gitbook/image?url=https%3A%2F%2F3453272330-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252F-M93cpHl7B7NPZlDrubS%252Fuploads%252Fgit-blob-c5549131267f785193993ab34df9d77ebfe83a9c%252Fsling-platform-architecture.png%3Falt%3Dmedia&amp;width=768&amp;dpr=4&amp;quality=100&amp;sign=2468d9&amp;sv=1" alt /></p>
<h3 id="heading-sling-platform-ui">Sling Platform UI</h3>
<p>The Sling Platform provides a modern, intuitive web interface for managing your data operations at scale. It's designed for teams who need a centralized way to manage, monitor, and collaborate on data movement tasks.</p>
<p><img src="https://slingdata.test/_image?href=%2F%40fs%2FUsers%2Ffritz%2F__%2FGit%2Fsling-website%2Fsrc%2Fassets%2Fimages%2Fscreenshots%2Fui.editor.dark.png%3ForigWidth%3D1906%26origHeight%3D1482%26origFormat%3Dpng&amp;f=webp" alt="best value" /></p>
<h4 id="heading-using-the-platform">Using the Platform</h4>
<p>The Platform UI makes it easy to:</p>
<ol>
<li><p><strong>Manage Connections</strong></p>
<ul>
<li><p>Create and test database connections</p>
</li>
<li><p>Store credentials securely</p>
</li>
<li><p>Share connections with team members</p>
</li>
<li><p>Monitor connection health</p>
</li>
</ul>
</li>
</ol>
<p><img src="https://slingdata.test/_image?href=%2F%40fs%2FUsers%2Ffritz%2F__%2FGit%2Fsling-website%2Fsrc%2Fassets%2Fimages%2Fscreenshots%2Fui.connections.dark.png%3ForigWidth%3D1864%26origHeight%3D1352%26origFormat%3Dpng&amp;h=700&amp;f=webp" alt="Explore your Data" /></p>
<ol start="2">
<li><p><strong>Design Replications</strong></p>
<ul>
<li><p>Real-time validation and feedback via Editor (IDE)</p>
</li>
<li><p>Create new replications visually</p>
</li>
<li><p>Configure source and target settings</p>
</li>
<li><p>Set up transformations</p>
</li>
<li><p>Define scheduling and triggers</p>
</li>
</ul>
</li>
<li><p><strong>Monitor Operations</strong></p>
<ul>
<li><p>Track replication status</p>
</li>
<li><p>View detailed execution logs</p>
</li>
<li><p>Analyze performance metrics</p>
</li>
<li><p>Set up alerts and notifications</p>
</li>
</ul>
</li>
</ol>
<p><img src="https://slingdata.test/_image?href=%2F%40fs%2FUsers%2Ffritz%2F__%2FGit%2Fsling-website%2Fsrc%2Fassets%2Fimages%2Fscreenshots%2Fui.history.dark.png%3ForigWidth%3D2112%26origHeight%3D1174%26origFormat%3Dpng&amp;h=700&amp;f=webp" alt="See Historical Logs" /></p>
<h3 id="heading-sling-platform-agents">Sling Platform Agents</h3>
<p>Sling Agents are the workers that execute your data operations. They can be deployed anywhere in your infrastructure, providing flexibility and security.</p>
<p><img src="https://slingdata.test/_image?href=%2F%40fs%2FUsers%2Ffritz%2F__%2FGit%2Fsling-website%2Fsrc%2Fassets%2Fimages%2Fscreenshots%2Fui.agent.dark.png%3ForigWidth%3D1182%26origHeight%3D1040%26origFormat%3Dpng&amp;h=700&amp;f=webp" alt="Manage Agents" /></p>
<h4 id="heading-key-features">Key Features</h4>
<ul>
<li><p><strong>Flexible Deployment</strong></p>
<ul>
<li><p>Run in your own infrastructure</p>
</li>
<li><p>Secure access to data sources</p>
</li>
</ul>
</li>
<li><p><strong>Smart Resource Management</strong></p>
<ul>
<li><p>Concurrent streams handling</p>
</li>
<li><p>Efficient memory utilization</p>
</li>
</ul>
</li>
<li><p><strong>Security First</strong></p>
<ul>
<li><p>Encrypted communication</p>
</li>
<li><p>No inbound connections required</p>
</li>
<li><p>Credential isolation</p>
</li>
</ul>
</li>
</ul>
<h4 id="heading-deployment-options">Deployment Options</h4>
<p>Agents can be deployed in various ways:</p>
<ol>
<li><p><strong>Local Development</strong></p>
<ul>
<li><p>Run alongside CLI for testing</p>
</li>
<li><p>Quick setup and configuration</p>
</li>
<li><p>Direct debugging capabilities</p>
</li>
</ul>
</li>
<li><p><strong>Production Environment</strong></p>
<ul>
<li><p>Container-based deployment</p>
</li>
<li><p>BYOC or Cloud Hosting</p>
</li>
<li><p>Resource optimization</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-getting-started-with-sling-platform">Getting Started with Sling Platform</h2>
<ol>
<li><p><strong>Sign Up</strong></p>
<ul>
<li><p>Visit <a target="_blank" href="https://platform.slingdata.io">platform.slingdata.io</a></p>
</li>
<li><p>Create your account</p>
</li>
<li><p>Set up your project</p>
</li>
</ul>
</li>
<li><p><strong>Deploy an Agent</strong></p>
<ul>
<li><p>Install the agent in your environment</p>
</li>
<li><p>Configure connection to platform</p>
</li>
<li><p>Test connectivity</p>
</li>
</ul>
</li>
<li><p><strong>Create Connections</strong></p>
<ul>
<li><p>Add your data sources</p>
</li>
<li><p>Configure credentials</p>
</li>
<li><p>Test connections</p>
</li>
</ul>
</li>
<li><p><strong>Create Your First Pipeline</strong></p>
<ul>
<li><p>Use the editor</p>
</li>
<li><p>Create a new Replication</p>
</li>
<li><p>Create a Job and test it</p>
</li>
<li><p>Deploy with a schedule and monitor</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-next-steps">Next Steps</h3>
<p>To learn more about Sling's capabilities:</p>
<p><strong>Explore Documentation</strong></p>
<ul>
<li><p><a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">CLI Documentation</a></p>
</li>
<li><p><a target="_blank" href="https://docs.slingdata.io/sling-platform/getting-started">Platform Guide</a></p>
</li>
<li><p><a target="_blank" href="https://slingdata.io/connectors/">Connection Types</a></p>
</li>
</ul>
<p>Start small with simple replications and gradually explore more advanced features as you become comfortable with the platform. Sling's flexibility means you can grow your usage alongside your data needs.</p>
]]></content:encoded></item><item><title><![CDATA[Efficient Data Lake Management with Sling and Delta Lake]]></title><description><![CDATA[Unlocking Delta Lake Insights with Sling: Efficient Read-Only Access
In the ever-evolving landscape of big data, Delta Lake has emerged as a powerful open-source storage layer that brings reliability and performance to data lakes. Today, we're thrill...]]></description><link>https://blog.slingdata.io/efficient-data-lake-management-with-sling-and-delta-lake</link><guid isPermaLink="true">https://blog.slingdata.io/efficient-data-lake-management-with-sling-and-delta-lake</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[deltalake]]></category><category><![CDATA[ETL]]></category><category><![CDATA[data integration]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Sat, 07 Sep 2024 10:04:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1725703147259/f7af828c-809d-41f2-8199-f80ef4f10531.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-unlocking-delta-lake-insights-with-sling-efficient-read-only-access">Unlocking Delta Lake Insights with Sling: Efficient Read-Only Access</h1>
<p>In the ever-evolving landscape of big data, Delta Lake has emerged as a powerful open-source storage layer that brings reliability and performance to data lakes. Today, we're thrilled to announce that Sling, our versatile data integration tool, now supports reading Delta Lake format, opening up new avenues for data engineers and analysts to harness the power of Delta tables.</p>
<h2 id="heading-what-is-sling">What is Sling?</h2>
<p><a target="_blank" href="https://slingdata.io">Sling</a> aims to augment the exporting/loading data process into a positive and potentially enjoyable experience. It offers both CLI and YAML-based configurations for easy setup and management of data flows, by focusing on 3 data types interfaces:</p>
<ul>
<li><p>From File Systems to Databases</p>
</li>
<li><p>From Databases to Databases</p>
</li>
<li><p>From Databases to File Systems</p>
</li>
</ul>
<p>The list of connections that Sling supports continues to grow. You can see the <a target="_blank" href="https://slingdata.io/en/connectors">full list</a> here, but it supports all the major platforms including <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/clickhouse">Clickhouse</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/duckdb">DuckDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigquery">Google BigQuery</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigtable">Google BigTable</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mariadb">MariaDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mongodb">MongoDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/motherduck">MotherDuck</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mysql">MySQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/oracle">Oracle</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/postgres">PostgreSQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">Prometheus</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/redshift">Redshift</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/snowflake">Snowflake</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlserver">SQL Server</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlite">SQLite</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">StarRocks</a> and <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/trino">Trino</a>.</p>
<h2 id="heading-delta-lake-a-game-changer-in-data-lakes">Delta Lake: A Game-Changer in Data Lakes</h2>
<p>Delta Lake, developed by Databricks, addresses many of the challenges faced by traditional data lakes. It introduces ACID transactions, scalable metadata handling, and time travel capabilities to big data workloads. These features make Delta Lake an attractive choice for organizations dealing with large-scale data processing and analytics.</p>
<p>Key benefits of Delta Lake include:</p>
<ol>
<li><strong>ACID Transactions</strong>: Ensures data consistency even with concurrent reads and writes.</li>
<li><strong>Schema Evolution and Enforcement</strong>: Allows for easy schema changes and maintains data quality.</li>
<li><strong>Time Travel</strong>: Enables querying data as it existed at a specific point in time.</li>
<li><strong>Unified Batch and Streaming</strong>: Seamlessly handles both batch and real-time data processing.</li>
<li><strong>Optimized Performance</strong>: Leverages various optimizations for faster queries on large datasets.</li>
</ol>
<h2 id="heading-slings-delta-lake-integration-read-only-power">Sling's Delta Lake Integration: Read-Only Power</h2>
<p>Sling now offers robust support for reading Delta Lake tables, leveraging the power of DuckDB under the hood. This integration allows users to easily incorporate Delta Lake data into their existing data pipelines and analytics workflows.</p>
<p>It's important to note that Sling's current implementation is read-only. While you can't write or modify Delta tables using Sling, you can efficiently extract data from Delta Lake for further processing or analysis.</p>
<h3 id="heading-how-sling-reads-delta-tables">How Sling Reads Delta Tables</h3>
<p>Sling utilizes DuckDB's Delta Lake reader to efficiently process Delta tables. Here's a brief overview of how it works:</p>
<ol>
<li><strong>DuckDB Integration</strong>: Sling uses DuckDB's built-in Delta reader, allowing for direct querying of Delta tables without additional dependencies.</li>
<li><strong>Delta Scan Function</strong>: Sling leverages DuckDB's <code>delta_scan</code> function to read Delta metadata and data files.</li>
<li><strong>Query Optimization</strong>: Sling constructs optimized SQL queries using the <code>delta_scan</code> function, ensuring efficient data retrieval.</li>
<li><strong>Streaming Results</strong>: Results are streamed from DuckDB, enabling efficient processing of large Delta tables without loading the entire dataset into memory.</li>
</ol>
<h2 id="heading-using-sling-with-delta-lake-practical-examples">Using Sling with Delta Lake: Practical Examples</h2>
<p>Let's explore how you can use Sling to read Delta Lake tables in various scenarios.</p>
<h3 id="heading-reading-delta-tables-with-sling-cli">Reading Delta Tables with Sling CLI</h3>
<p>To read Delta Lake files using Sling's command-line interface, you can use the following commands:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Read a local Delta table</span>
sling run --src-stream file://path/to/table \
  --src-options <span class="hljs-string">'{format: delta}'</span> \
  --stdout --<span class="hljs-built_in">limit</span> 100

<span class="hljs-comment"># Read a Delta table from AWS S3</span>
sling run --src-conn aws_s3 \
  --src-stream path/to/table \
  --src-options <span class="hljs-string">'{format: delta}'</span> \
  --stdout --<span class="hljs-built_in">limit</span> 100
</code></pre>
<p>These commands will read the specified Delta table and output the first 100 rows to the console.</p>
<h3 id="heading-incorporating-delta-lake-in-replication-yaml">Incorporating Delta Lake in Replication YAML</h3>
<p>For more complex data integration tasks, you can specify Delta as a format in your <code>replication.yaml</code> file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">source:</span> <span class="hljs-string">aws_s3</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">postgres</span>

<span class="hljs-attr">defaults:</span>
  <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>
  <span class="hljs-attr">source_options:</span>
    <span class="hljs-attr">format:</span> <span class="hljs-string">delta</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-attr">path/to/delta/table:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">my_schema.delta_table</span>

  <span class="hljs-string">path/to/delta_tables/*:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">my_schema.{stream_file_name}</span>
</code></pre>
<p>This configuration reads data from Delta tables stored in AWS S3 and loads it into PostgreSQL tables. The <code>full-refresh</code> mode indicates that the target table will be completely replaced with the data from the source Delta table during each replication run.</p>
<p>To execute the replication, use:</p>
<pre><code class="lang-bash">sling run -d -r replication.yaml
</code></pre>
<p>See docs <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> to get started with Sling!</p>
<h2 id="heading-real-world-use-case-analytics-on-e-commerce-data">Real-World Use Case: Analytics on E-commerce Data</h2>
<p>Imagine you're working with a large e-commerce platform that stores its transaction data in Delta format on AWS S3. You need to perform daily analytics on this data using your PostgreSQL data warehouse. Here's how you could use Sling to streamline this process:</p>
<ol>
<li><strong>Set up a replication YAML file</strong> to read from your Delta tables and write to PostgreSQL.</li>
<li><strong>Schedule daily Sling runs</strong> to keep your analytics database up-to-date.</li>
<li><strong>Leverage Delta Lake's time travel</strong> feature by specifying a timestamp in your Sling configuration to analyze historical data.</li>
<li><strong>Use Sling's column selection</strong> feature to optimize data transfer by only reading the columns you need for your analytics.</li>
</ol>
<p>This setup allows you to take advantage of Delta Lake's reliability and performance while using Sling's simplicity and flexibility for your data integration needs.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Sling's new capability to read Delta Lake format opens up exciting possibilities for data engineers and analysts. By combining Delta Lake's robust features with Sling's efficient data integration capabilities, organizations can streamline their data workflows and gain valuable insights from their data lakes.</p>
<p>While the current implementation is read-only, it provides a powerful tool for extracting and analyzing data stored in Delta format. As we continue to develop Sling, we're excited about the potential for expanding our Delta Lake support in the future via DuckDB.</p>
]]></content:encoded></item><item><title><![CDATA[Reading Apache Iceberg Data with Sling]]></title><description><![CDATA[We're excited to announce that Sling now supports reading the Apache Iceberg format, bringing enhanced data lake management capabilities to our users. This addition opens up new possibilities for efficient and flexible data handling in large-scale en...]]></description><link>https://blog.slingdata.io/reading-apache-iceberg-data-with-sling</link><guid isPermaLink="true">https://blog.slingdata.io/reading-apache-iceberg-data-with-sling</guid><category><![CDATA[apacheiceberg]]></category><category><![CDATA[ELT]]></category><category><![CDATA[ETL]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Wed, 28 Aug 2024 11:24:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/M-EwSRl8BK8/upload/475897f248b4e9729aabdc1abeef172d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're excited to announce that Sling now supports reading the <a target="_blank" href="https://iceberg.apache.org/">Apache Iceberg</a> format, bringing enhanced data lake management capabilities to our users. This addition opens up new possibilities for efficient and flexible data handling in large-scale environments.</p>
<h2 id="heading-what-is-sling">What is Sling?</h2>
<p><a target="_blank" href="https://slingdata.io">Sling</a> aims to augment the exporting/loading data process into a positive and potentially enjoyable experience. It offers both CLI and YAML-based configurations for easy setup and management of data flows, by focusing on 3 data types interfaces:</p>
<ul>
<li><p>From File Systems to Databases</p>
</li>
<li><p>From Databases to Databases</p>
</li>
<li><p>From Databases to File Systems</p>
</li>
</ul>
<p>The list of connections that Sling supports continues to grow. You can see the <a target="_blank" href="https://slingdata.io/en/connectors">full list</a> here, but it supports all the major platforms including <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/clickhouse">Clickhouse</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/duckdb">DuckDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigquery">Google BigQuery</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigtable">Google BigTable</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mariadb">MariaDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mongodb">MongoDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/motherduck">MotherDuck</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mysql">MySQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/oracle">Oracle</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/postgres">PostgreSQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">Prometheus</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/redshift">Redshift</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/snowflake">Snowflake</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlserver">SQL Server</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlite">SQLite</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">StarRocks</a> and <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/trino">Trino</a>.</p>
<h2 id="heading-what-is-apache-iceberg">What is Apache Iceberg?</h2>
<p><a target="_blank" href="https://iceberg.apache.org/">Apache Iceberg</a> is an open table format for huge analytic datasets. It's designed to improve on the limitations of older table formats, offering better performance, reliability, and flexibility for data lakes.</p>
<h2 id="heading-advantages-of-using-iceberg">Advantages of Using Iceberg</h2>
<ol>
<li><p><strong>Schema Evolution</strong>: Iceberg allows for easy schema changes without the need for data migration.</p>
</li>
<li><p><strong>Partition Evolution</strong>: You can change partition schemes without rewriting data.</p>
</li>
<li><p><strong>Time Travel</strong>: Query data as it existed at a specific point in time.</p>
</li>
<li><p><strong>ACID Transactions</strong>: Ensures data consistency and reliability.</p>
</li>
<li><p><strong>Performance</strong>: Optimized for fast queries on large datasets.</p>
</li>
</ol>
<h2 id="heading-icebergs-popularity">Iceberg's Popularity</h2>
<p>Iceberg has gained significant traction in the data engineering community. It's used by major companies like Netflix, Apple, and Adobe, and is supported by popular data processing tools like Spark, Flink, and Presto.</p>
<h2 id="heading-how-sling-uses-duckdb-to-read-iceberg">How Sling Uses DuckDB to Read Iceberg</h2>
<p>Under the hood, Sling leverages DuckDB's powerful Iceberg integration to read and process Iceberg tables efficiently. DuckDB is working on adding support for writing to Iceberg tables, and we're excited to see what new features this will bring to the table. Here's a brief overview of how this works:</p>
<ol>
<li><p><strong>DuckDB Integration</strong>: Sling utilizes DuckDB's built-in Iceberg reader, which allows for direct querying of Iceberg tables without the need for additional dependencies.</p>
</li>
<li><p><strong>Iceberg Scan Function</strong>: When reading an Iceberg table, Sling uses DuckDB's <code>iceberg_scan</code> function. This function is capable of reading Iceberg metadata and data files directly.</p>
</li>
<li><p><strong>Query Generation</strong>: Sling constructs a SQL query using the <code>iceberg_scan</code> function. For example:</p>
<pre><code class="lang-sql"> <span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> iceberg_scan(<span class="hljs-string">'path/to/iceberg/table'</span>, allow_moved_paths = <span class="hljs-literal">true</span>)
</code></pre>
<p> The <code>allow_moved_paths</code> option is set to true to handle cases where data files might have been moved.</p>
</li>
<li><p><strong>Column Projection</strong>: When specific columns are requested, Sling modifies the query to select only those columns, optimizing read performance.</p>
</li>
<li><p><strong>Type Mapping</strong>: Sling maps Iceberg/DuckDB types to its internal column types for consistent data handling across different sources.</p>
</li>
<li><p><strong>Metadata Retrieval</strong>: Before executing the main query, Sling uses DuckDB to fetch table metadata, including schema information, by running a <code>DESCRIBE</code> query on the Iceberg scan.</p>
</li>
<li><p><strong>Streaming Results</strong>: Sling streams the results from DuckDB, allowing for efficient processing of large Iceberg tables without loading the entire dataset into memory.</p>
</li>
</ol>
<p>This approach allows Sling to provide seamless support for Iceberg tables, leveraging DuckDB's optimized Iceberg reader while maintaining Sling's flexible and user-friendly interface.</p>
<h2 id="heading-reading-iceberg-with-sling-cli">Reading Iceberg with Sling CLI</h2>
<p>To work with Iceberg files in Sling, you can use the following CLI flags:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># read a local iceberg table</span>
sling run --src-stream file://path/to/table \
  --src-options <span class="hljs-string">'{format: iceberg}'</span> \
  --stdout --<span class="hljs-built_in">limit</span> 100

<span class="hljs-comment"># read an iceberg table from aws s3</span>
sling run --src-conn aws_s3 \
  --src-stream path/to/table \
  --src-options <span class="hljs-string">'{format: iceberg}'</span> \
  --stdout --<span class="hljs-built_in">limit</span> 100
</code></pre>
<p>This command reads an Iceberg table located at <code>path/to/table</code> and outputs the results to the console (limited to 100 rows).</p>
<pre><code class="lang-bash"><span class="hljs-comment"># read a local iceberg table, write to bigquery</span>
sling run --src-stream file://path/to/table \
  --src-options <span class="hljs-string">'{format: iceberg}'</span> \
  --tgt-conn bigquery \
  --tgt-object bq_schema.bq_table
</code></pre>
<p>This command reads an Iceberg table located at <code>path/to/table</code> and writes it to our BigQuery connection.</p>
<h2 id="heading-reading-iceberg-in-replication-yaml">Reading Iceberg in Replication YAML</h2>
<p>You can also specify Iceberg as a format in your <code>replication.yaml</code> file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">source:</span> <span class="hljs-string">aws_s3</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">postgres</span>

<span class="hljs-attr">defaults:</span>
  <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>
  <span class="hljs-attr">source_options:</span>
    <span class="hljs-attr">format:</span> <span class="hljs-string">iceberg</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-attr">path/to/iceberg/table:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">my_schema.iceberg_table</span>
</code></pre>
<p>This configuration will read data from an Iceberg table stored in AWS S3 and load it into a PostgreSQL table named <code>my_schema.iceberg_table</code>. The <code>full-refresh</code> mode indicates that the target table will be completely replaced with the data from the source Iceberg table during each replication run. We can also use wildcards to read multiple Iceberg tables. See the <a target="_blank" href="https://docs.slingdata.io/sling-cli/run/configuration/variables">variables docs</a> for a complete list of variables.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">source:</span> <span class="hljs-string">aws_s3</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">postgres</span>

<span class="hljs-attr">defaults:</span>
  <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>
  <span class="hljs-attr">source_options:</span>
    <span class="hljs-attr">format:</span> <span class="hljs-string">iceberg</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-string">path/to/iceberg_tables/*:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">my_schema.{stream_file_name}</span>

  <span class="hljs-string">path/to/more_iceberg_tables/prefix*:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">my_schema.{stream_file_name}</span>
</code></pre>
<p>Running a replication is easily done with the <code>sling run</code> command:</p>
<pre><code class="lang-bash">sling run -d -r replication.yaml
</code></pre>
<p>See docs <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> to get started with Sling!</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Iceberg support in Sling, powered by DuckDB, offers a powerful and efficient way to work with large-scale data stored in the Iceberg format. By leveraging Sling's intuitive configuration and DuckDB's performance, users can easily integrate Iceberg tables into their data pipelines and analytics workflows.</p>
<p>As the Apache Iceberg ecosystem continues to grow and evolve, we anticipate expanding Sling's capabilities to include writing to Iceberg tables and supporting more advanced Iceberg features. This will further enhance Sling's position as a versatile tool for modern data engineering tasks.</p>
<p>We encourage users to explore the Iceberg integration in Sling and provide feedback. Your input is valuable in shaping the future development of this feature and ensuring it meets the diverse needs of the data community.</p>
]]></content:encoded></item><item><title><![CDATA[Export Data From Prometheus into any Database]]></title><description><![CDATA[Introduction
Sling aims to augment the exporting/loading data process into a positive and potentially enjoyable experience. It focuses on 3 of data types interfaces:

From File Systems to Databases
From Databases to Databases
From Databases to File S...]]></description><link>https://blog.slingdata.io/export-data-from-prometheus-into-any-database</link><guid isPermaLink="true">https://blog.slingdata.io/export-data-from-prometheus-into-any-database</guid><category><![CDATA[#prometheus]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[ETL]]></category><category><![CDATA[ELT]]></category><category><![CDATA[Sling]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Sat, 13 Apr 2024 03:00:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712804132453/db90c9ec-ef91-42f3-98de-502243c8c75e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p><a target="_blank" href="https://slingdata.io">Sling</a> aims to augment the exporting/loading data process into a positive and potentially enjoyable experience. It focuses on 3 of data types interfaces:</p>
<ul>
<li>From File Systems to Databases</li>
<li>From Databases to Databases</li>
<li>From Databases to File Systems</li>
</ul>
<p>The list of connections that Sling supports continues to grow. You can see the <a target="_blank" href="https://slingdata.io/en/connectors">full list</a> here, but it supports all the major platforms including <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/clickhouse">Clickhouse</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/duckdb">DuckDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigquery">Google BigQuery</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigtable">Google BigTable</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mariadb">MariaDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mongodb">MongoDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/motherduck">MotherDuck</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mysql">MySQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/oracle">Oracle</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/postgres">PostgreSQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">Prometheus</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/redshift">Redshift</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/snowflake">Snowflake</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlserver">SQL Server</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlite">SQLite</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">StarRocks</a> and <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/trino">Trino</a>.</p>
<h1 id="heading-prometheus">Prometheus</h1>
<p>Prometheus is an open-source time-series database originally built by SoundCloud. It is a powerful and flexible monitoring solution ideal for gathering, querying, and alerting on metrics data in dynamic and distributed environments. Its pull-based model, multi-dimensional data model, PromQL query language, and alerting capabilities make it a popular choice for monitoring modern cloud-native applications and infrastructure.</p>
<h1 id="heading-export-data-from-prom">Export data from Prom</h1>
<p>An issue faced dealing with Prometheus is the difficulty in exporting data from it. Fortunately, Sling handles this without a skip.</p>
<h2 id="heading-install-sling">Install Sling</h2>
<p>See <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> for details on how to install Sling. It is usually a simple command, such as:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># On Linux</span>
curl -LO <span class="hljs-string">'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz'</span> \
  &amp;&amp; tar xf sling_linux_amd64.tar.gz \
  &amp;&amp; rm -f sling_linux_amd64.tar.gz \
  &amp;&amp; chmod +x sling
</code></pre>
<p>You should be able to run the <code>sling</code> command at this point.</p>
<h2 id="heading-set-up-connections">Set up Connections</h2>
<p>Let's now set up the connections. In this blog post, we will use Postgres as the destination database, however the steps are the same if you'd like to load into a different database.</p>
<h3 id="heading-setting-up-prometheus">Setting up Prometheus</h3>
<p>See <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">here</a> for more details on the keys accepted. But we can simply do:</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">set</span> prometheus <span class="hljs-built_in">type</span>=prometheus http_url=<span class="hljs-string">"http://localhost:9090"</span> api_key=<span class="hljs-string">"xxxxxxxxxxxxxxxxxxxxxx"</span>

$ sling conns <span class="hljs-built_in">test</span> prometheus
6:01PM INF success!

<span class="hljs-comment"># get list of metrics</span>
$ sling conns discover prometheus --column
+------------+------------+------------+-----+----------------------------------+-------------+--------------+
| DATABASE   | SCHEMA     | TABLE      |  ID | COLUMN                           | NATIVE TYPE | GENERAL TYPE |
+------------+------------+------------+-----+----------------------------------+-------------+--------------+
| prometheus | prometheus | prometheus |   1 | go_gc_duration_seconds           | summary     | bigint       |
| prometheus | prometheus | prometheus |   2 | go_goroutines                    | gauge       | bigint       |
| prometheus | prometheus | prometheus |   3 | go_info                          | gauge       | bigint       |
| prometheus | prometheus | prometheus |   4 | go_memstats_alloc_bytes          | gauge       | bigint       |
| prometheus | prometheus | prometheus |   5 | go_memstats_alloc_bytes_total    | counter     | bigint       |
| prometheus | prometheus | prometheus |   6 | go_memstats_buck_hash_sys_bytes  | gauge       | bigint       |
| prometheus | prometheus | prometheus |   7 | go_memstats_frees_total          | counter     | bigint       |
.....
</code></pre>
<h3 id="heading-setting-up-postgres">Setting up Postgres</h3>
<p>Similarly, we will again use the <code>sling conns set</code> command, this time for our PG connection. See <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">here</a> for more details.</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">set</span> postgres url=<span class="hljs-string">"postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable"</span>

$ sling conns <span class="hljs-built_in">test</span> postgres
6:03PM INF success!
</code></pre>
<p>Great, we are now ready to move data!</p>
<h2 id="heading-run-with-cli-flags">Run with CLI Flags</h2>
<pre><code class="lang-bash"><span class="hljs-comment"># export results to stdout, with start time from 2 months ago: "now-2M"</span>
$ ./sling run --src-conn prometheus \
    --src-stream <span class="hljs-string">'sum(go_gc_duration_seconds) by (job, instance, quantile) # {"start": "now-2M"}'</span> \
    --stdout --<span class="hljs-built_in">limit</span> 10 -d

<span class="hljs-comment"># load into PG</span>
$ ./sling run --src-conn prometheus \
    --src-stream <span class="hljs-string">'sum(go_gc_duration_seconds) by (job, instance, quantile) # {"start": "now-2M"}'</span> \
    --tgt-conn postgres --tgt-object public.gc_duration_by_job \
    --mode full-refresh
</code></pre>
<h3 id="heading-time-filters">Time Filters</h3>
<p>Adding time filters, simply add a suffix to your query. Accepts a JSON value with keys <code>start</code>, <code>end</code> and <code>step</code>:</p>
<ul>
<li><p><code># {"start": "now-2M"}</code></p>
</li>
<li><p><code># {"start": "now-2M", "end": "now-1d"}</code></p>
</li>
<li><p><code># {"start": "now-2M", "end": "now-1d", "step": "1d"}</code></p>
</li>
</ul>
<h2 id="heading-run-with-replication">Run with Replication</h2>
<p>We can also use a <a target="_blank" href="https://docs.slingdata.io/sling-cli/run/configuration/replication">replication</a>. Replications are the best way to use sling in a reusable manner. The <code>defaults</code> key allows reusing your inputs with the ability to override any of them in a particular stream. Both YAML or JSON files are accepted. </p>
<pre><code class="lang-yaml"><span class="hljs-comment"># replication.yaml</span>
<span class="hljs-attr">source:</span> <span class="hljs-string">prometheus</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">postgres</span>

<span class="hljs-attr">defaults:</span>
  <span class="hljs-attr">object:</span> <span class="hljs-string">prometheus.{stream_name}</span>
  <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-attr">gc_duration_by_job:</span>
    <span class="hljs-attr">sql:</span> <span class="hljs-string">'sum(go_gc_duration_seconds) by (job, instance, quantile) # {"start": "now-2M", "end": "now-1d", "step": "1d"}'</span>

<span class="hljs-comment"># incremental load, last 2 days of data, hourly</span>
  <span class="hljs-attr">go_memstats_alloc_bytes_total:</span>
    <span class="hljs-attr">sql:</span> <span class="hljs-string">'sum(go_memstats_alloc_bytes_total) by (job, instance, quantile) # {"start": "now-2d"}'</span>
    <span class="hljs-attr">primary_key:</span> [<span class="hljs-string">timestamp</span>, <span class="hljs-string">job</span>, <span class="hljs-string">instance</span>, <span class="hljs-string">quantile</span>]
    <span class="hljs-attr">update_key:</span> <span class="hljs-string">timestamp</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">incremental</span>
</code></pre>
<p>We can run the replication like this:</p>
<pre><code class="lang-bash">sling run -r replication.yaml
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>We went over on how easy it was to export data with Sling from Prometheus. Feel free to check out other examples here: https://docs.slingdata.io.</p>
]]></content:encoded></item><item><title><![CDATA[Export Data From StarRocks into DuckDB]]></title><description><![CDATA[Introduction
Let's look at how we can easily export data from StarRocks into a local DuckDB database with Sling, a flexible command-line interface (CLI) data integration tool that enables rapid extraction and loading of data directly from the termina...]]></description><link>https://blog.slingdata.io/export-data-from-starrocks-into-duckdb</link><guid isPermaLink="true">https://blog.slingdata.io/export-data-from-starrocks-into-duckdb</guid><category><![CDATA[duckDB]]></category><category><![CDATA[starrocks]]></category><category><![CDATA[ETL]]></category><category><![CDATA[ELT]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Fri, 12 Apr 2024 19:06:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712745662067/951a11b8-32f1-4b2d-b168-5e6da7794639.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Let's look at how we can easily export data from StarRocks into a local DuckDB database with <a target="_blank" href="https://slingdata.io">Sling</a>, a flexible command-line interface (CLI) data integration tool that enables rapid extraction and loading of data directly from the terminal.</p>
<h1 id="heading-starrocks">StarRocks</h1>
<p><a target="_blank" href="https://docs.starrocks.io/">StarRocks</a> is a powerful distributed, columnar storage database system designed for real-time analytics. It can scale to Petabytes and connects to systems like HDFS, Apache Spark, Apache Flink and Apache Kafka. It is backed by CelerData, a $60 million VC-funded startup, and aims to be an open-source replacement for Snowflake, BigQuery, and Redshift. This makes it suitable for many analytics use cases, such as business intelligence, ad hoc querying, real-time events and even machine learning / AI-driven data processing.</p>
<h1 id="heading-duckdb">DuckDB</h1>
<p><a target="_blank" href="https://duckdb.org/">DuckDB</a> on the other hand, is a lightweight database dubbed as the SQLite of data warehouses. It is a local database saved in a single file (like SQLite) and enables many kinds of analytical use cases right on your machine, since the data is stored in columnar fashion. This is especially convenient when one wants to join several tables and do local analytics without dealing with external latencies and workloads.</p>
<h1 id="heading-export-data-with-sling">Export Data with Sling</h1>
<p>As with have it, Sling can Read/Write from both of these databases. Let's go over the steps.</p>
<p>First, let's install Sling. See <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> to see how to do so. It is usually a simple command, such as:</p>
<pre><code># On Mac
brew install slingdata-io/sling/sling

# On Windows Powershell
scoop bucket add org https:<span class="hljs-comment">//github.com/slingdata-io/scoop-sling.git</span>
scoop install sling

# On Linux
curl -LO <span class="hljs-string">'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz'</span> \
  &amp;&amp; tar xf sling_linux_amd64.tar.gz \
  &amp;&amp; rm -f sling_linux_amd64.tar.gz \
  &amp;&amp; chmod +x sling
</code></pre><p>Next, we'll set up StarRocks. If you don't have a StarRocks instance running, you can quickly launch a development instance on your machine with <a target="_blank" href="https://docs.docker.com/get-docker/">docker</a> and load data into it.</p>
<pre><code class="lang-bash">docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
</code></pre>
<p>Now we can create a connection for our StarRocks database with Sling and test connectivity. See <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">here</a> for more details on configuration.</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">set</span> starrocks url=<span class="hljs-string">"starrocks://root:@localhost:9030/sling"</span>
10:01PM INF connection `starrocks` has been <span class="hljs-built_in">set</span> <span class="hljs-keyword">in</span> /Users/me/.sling/env.yaml. Please <span class="hljs-built_in">test</span> with `sling conns <span class="hljs-built_in">test</span> starrocks`

$ sling conns <span class="hljs-built_in">test</span> starrocks
10:01PM INF success!
</code></pre>
<p>Now let's set up our DuckDB connection. Since it is an embedded database, we don't need to install anything. In fact, Sling will auto-download the binary to interface with DuckDB the first time. See <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/duckdb">here</a> for more details on configuring DuckDB.</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">set</span> duckdb url=<span class="hljs-string">"duckdb:///tmp/sling/duck.db"</span>
10:02PM INF connection `duckdb` has been <span class="hljs-built_in">set</span> <span class="hljs-keyword">in</span> /Users/me/.sling/env.yaml. Please <span class="hljs-built_in">test</span> with `sling conns <span class="hljs-built_in">test</span> duckdb`

$ sling conns <span class="hljs-built_in">test</span> duckdb
10:02PM INF success!
</code></pre>
<p>Great, we're ready to export some data into DuckDB.</p>
<h2 id="heading-replication">Replication</h2>
<p>Let's assume you already have data in your StarRocks instance, in the database <code>public</code>(see <a target="_blank" href="https://blog.slingdata.io/load-data-into-starrocks-from-any-database">here</a> on how to load data into StarRocks). If you'd like to export all the tables in that database, you'd create a <a target="_blank" href="https://docs.slingdata.io/sling-cli/run/configuration/replication">replication</a> like this:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># replication.yaml</span>
<span class="hljs-attr">source:</span> <span class="hljs-string">starrocks</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">duckdb</span>

<span class="hljs-attr">defaults:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">main.{stream_schema}_{stream_table}</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-comment"># all tables in schema `public`</span>
  <span class="hljs-string">public.*:</span>

 <span class="hljs-comment"># only one table</span>
  <span class="hljs-attr">salesforce.account:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">main.salesforce_account</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>
</code></pre>
<p>We can run the replication like this:</p>
<pre><code class="lang-bash">sling run -r replication.yaml
</code></pre>
<p></p><details><summary>Output</summary><p></p>
<p></p><div data-type="detailsContent"><pre><code>
10:56PM INF Sling Replication [24 streams] | starrocks -&gt; duckdb<p></p>
<p>10:56PM INF [1 / 24] running stream "public"."call_center"
10:56PM INF connecting to source database (starrocks)
10:56PM INF connecting to target database (duckdb)
10:56PM INF reading from source database
10:56PM INF writing to target database [mode: full-refresh]
10:56PM INF streaming data
10:56PM INF created table <code>main</code>.<code>public_call_center</code>
10:56PM INF inserted 62347 rows into <code>main</code>.<code>public_call_center</code> in 10 secs [6234 r/s] [20.0 MB]
10:56PM INF execution succeeded</p>
<p>10:56PM INF [2 / 24] running stream "public"."catalog_page"
10:56PM INF connecting to source database (starrocks)
10:56PM INF connecting to target database (duckdb)
10:56PM INF reading from source database
10:56PM INF writing to target database [mode: full-refresh]
10:56PM INF streaming data
10:56PM INF created table <code>main</code>.<code>public_catalog_page</code>
10:56PM INF inserted 11718 rows into <code>main</code>.<code>public_catalog_page</code> in 0 secs [14,126 r/s] [2.0 MB]
10:56PM INF execution succeeded</p>
<p>.......</p>
<p>11:15PM INF [24 / 24] running stream "salesforce"."account"
11:15PM INF connecting to source database (starrocks)
11:15PM INF connecting to target database (duckdb)
11:15PM INF reading from source database
11:15PM INF writing to target database [mode: full-refresh]
11:15PM INF streaming data
11:15PM INF created table <code>main</code>.<code>salesforce_account</code>
11:15PM INF inserted 71654 rows into <code>main</code>.<code>salesforce_account</code> in 2 secs [30,073 r/s] [12 MB]
11:15PM INF execution succeeded</p>
<p>11:20PM INF Sling Replication Completed in 19m 34s | starrocks -&gt; duckdb | 24 Successes | 0 Failures
</p></code></pre></div></details><p></p><p></p>
<p>So easy! See many more examples here: https://docs.slingdata.io/sling-cli/run/examples</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>We went over on how easy it was to install Sling, and export data from StarRocks into DuckDB. Feel free to check out other examples here: https://docs.slingdata.io.</p>
]]></content:encoded></item><item><title><![CDATA[Load Data into StarRocks from Any Database]]></title><description><![CDATA[Introduction
Let's look at how we can easily export load data into a StarRocks from most major databases with Sling, a versatile CLI data integration tool which allows you to quickly extract and load data right from the terminal.
Sling is a tool with...]]></description><link>https://blog.slingdata.io/load-data-into-starrocks-from-any-database</link><guid isPermaLink="true">https://blog.slingdata.io/load-data-into-starrocks-from-any-database</guid><category><![CDATA[starrocks]]></category><category><![CDATA[ETL]]></category><category><![CDATA[ELT]]></category><category><![CDATA[MySQL]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Wed, 10 Apr 2024 11:19:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/li9Zr9Ft99Y/upload/2cea8b966fa2b8b1b245ff379cf70708.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Let's look at how we can easily export load data into a StarRocks from most major databases with <a target="_blank" href="https://slingdata.io">Sling</a>, a versatile CLI data integration tool which allows you to quickly extract and load data right from the terminal.</p>
<p>Sling is a tool with a goal of making the experience of ingesting data a positive, even pleasant one. Sling focuses on 3 of data types interfaces:</p>
<ul>
<li><p>From File Systems to Databases</p>
</li>
<li><p>From Databases to Databases</p>
</li>
<li><p>From Databases to File Systems</p>
</li>
</ul>
<p>The list of connections that Sling supports continues to grow. You can see the <a target="_blank" href="https://slingdata.io/en/connectors">full list</a> here, but it supports all the major platforms including <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/clickhouse">Clickhouse</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/duckdb">DuckDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigquery">Google BigQuery</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/bigtable">Google BigTable</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mariadb">MariaDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mongodb">MongoDB</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/motherduck">MotherDuck</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mysql">MySQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/oracle">Oracle</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/postgres">PostgreSQL</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/prometheus">Prometheus</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/redshift">Redshift</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/snowflake">Snowflake</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlserver">SQL Server</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/sqlite">SQLite</a>, <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">StarRocks</a> and <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/trino">Trino</a>.</p>
<h1 id="heading-starrocks">StarRocks</h1>
<p><a target="_blank" href="https://docs.starrocks.io/">StarRocks</a> is a powerful distributed, columnar storage database system designed for real-time analytics. One of the key features is that it supports several <a target="_blank" href="https://docs.starrocks.io/docs/table_design/table_types/">table type</a> designs which can meet varying business requirements:</p>
<ul>
<li><strong>Duplicate Key table</strong>: each of the records as a separate row</li>
<li><strong>Aggregate table</strong>: the aggregated record as a row</li>
<li><strong>Primary Key table</strong>: only the most recently loaded record as a row (based on key)</li>
</ul>
<p>This makes it suitable for many analytics use cases, such as business intelligence, ad hoc querying, real-time events and even machine learning / AI-driven data processing.</p>
<h1 id="heading-load-data-with-sling">Load Data with Sling</h1>
<p>First, let us install Sling. See <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> on details how to do so. It is usually a simple command, such as:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># On Linux</span>
curl -LO <span class="hljs-string">'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz'</span> \
  &amp;&amp; tar xf sling_linux_amd64.tar.gz \
  &amp;&amp; rm -f sling_linux_amd64.tar.gz \
  &amp;&amp; chmod +x sling
</code></pre>
<p>You should be able to run the <code>sling</code> command at this point.</p>
<hr />
<p>Next, we'll set up StarRocks. If you don't have a StarRocks instance running, you can quickly launch a development instance on your machine with <a target="_blank" href="https://docs.docker.com/get-docker/">docker</a> and load data into it.</p>
<pre><code class="lang-bash">docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
</code></pre>
<p>Now we can create a connection for our StarRocks database with Sling and test connectivity. It is important to set the <code>fe_url</code> so we can <a target="_blank" href="https://docs.starrocks.io/docs/sql-reference/sql-statements/data-manipulation/STREAM_LOAD/">Stream Load</a> into StarRocks See <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/starrocks">here</a> for more details on configuration.</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">set</span> starrocks url=<span class="hljs-string">"starrocks://root:@localhost:9030/sling"</span> fe_url=<span class="hljs-string">"http://root:@localhost:8030"</span>
8:01PM INF connection `starrocks` has been <span class="hljs-built_in">set</span> <span class="hljs-keyword">in</span> /Users/me/.sling/env.yaml. Please <span class="hljs-built_in">test</span> with `sling conns <span class="hljs-built_in">test</span> starrocks`

<span class="hljs-comment"># test our connection</span>
$ sling conns <span class="hljs-built_in">test</span> starrocks
8:01PM INF success!
</code></pre>
<hr />
<p>Great. In this tutorial, we will be extracting data from a MySQL database. We can set up the connection as shown below (see <a target="_blank" href="https://docs.slingdata.io/connections/database-connections/mysql">here</a> for more details).</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">set</span> mysql url=<span class="hljs-string">"mysql://admin:password@localhost:3306/mysql"</span>
8:02PM INF connection `mysql` has been <span class="hljs-built_in">set</span> <span class="hljs-keyword">in</span> /Users/me/.sling/env.yaml. Please <span class="hljs-built_in">test</span> with `sling conns <span class="hljs-built_in">test</span> mysql`

<span class="hljs-comment"># test our connection</span>
$ sling conns <span class="hljs-built_in">test</span> mysql
8:02PM INF success!

<span class="hljs-comment"># discover our tables</span>
$ sling conns discover mysql
+-----+--------------------+----------------------------------------------+-------+---------+
|   <span class="hljs-comment"># | SCHEMA             | NAME                                         | TYPE  | COLUMNS |</span>
+-----+--------------------+----------------------------------------------+-------+---------+
|   1 | information_schema | ADMINISTRABLE_ROLE_AUTHORIZATIONS            | table |       9 |
|   2 | information_schema | APPLICABLE_ROLES                             | table |       9 |
|   3 | information_schema | CHARACTER_SETS                               | table |       4 |
|   4 | information_schema | CHECK_CONSTRAINTS                            | table |       4 |
|   5 | information_schema | COLLATIONS                                   | table |       7 |
|   6 | information_schema | COLLATION_CHARACTER_SET_APPLICABILITY        | table |       2 |
.....
</code></pre>
<p>Great, we're ready to load some data into StarRocks!</p>
<h2 id="heading-replication">Replication</h2>
<p>We'll be using a <a target="_blank" href="https://docs.slingdata.io/sling-cli/run/configuration/replication">replication</a> to define what tables Sling should load. Replications are the best way to use sling in a reusable manner. The <code>defaults</code> key allows reusing your inputs with the ability to override any of them in a particular stream. Both YAML or JSON files are accepted. </p>
<pre><code class="lang-yaml"><span class="hljs-comment"># replication.yaml</span>
<span class="hljs-attr">source:</span> <span class="hljs-string">mysql</span>
<span class="hljs-attr">target:</span> <span class="hljs-string">starrocks</span>

<span class="hljs-attr">defaults:</span>
    <span class="hljs-attr">object:</span> <span class="hljs-string">main.{stream_schema}_{stream_table}</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">full-refresh</span>

<span class="hljs-attr">streams:</span>
  <span class="hljs-comment"># all tables in schema `mysql`</span>
  <span class="hljs-string">mysql.*:</span>

 <span class="hljs-comment"># only one table with specific duplicate keys</span>
  <span class="hljs-attr">finance.account_sales:</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">truncate</span>
    <span class="hljs-attr">target_options:</span>
      <span class="hljs-attr">table_keys:</span>
        <span class="hljs-attr">duplicate:</span> [<span class="hljs-string">account_id</span>, <span class="hljs-string">sale_id</span>]
</code></pre>
<p>We can run the replication like this:</p>
<pre><code class="lang-bash">sling run -r replication.yaml
</code></pre>
<p></p><details><summary>Output</summary><p></p>
<p></p><div data-type="detailsContent"><pre><code>
09:14PM INF Sling Replication [5 streams] | mysql -&gt; starrocks<p></p>
<p>09:14PM INF [1 / 5] running stream "mysql"."accounts"
09:14PM INF connecting to source database (mysql)
09:14PM INF connecting to target database (starrocks)
09:14PM INF reading from source database
09:14PM INF writing to target database [mode: full-refresh]
09:14PM INF streaming data
09:14PM INF importing into StarRocks via stream load
09:14PM INF created table <code>main</code>.<code>mysql_accounts</code>
09:14PM INF inserted 62347 rows into <code>main</code>.<code>mysql_accounts</code> in 10 secs [6234 r/s] [20.0 MB]
09:14PM INF execution succeeded</p>
<p>09:14PM INF [2 / 5] running stream "mysql"."orders"
09:14PM INF connecting to source database (mysql)
09:14PM INF connecting to target database (starrocks)
09:14PM INF reading from source database
09:14PM INF writing to target database [mode: full-refresh]
09:14PM INF streaming data
09:14PM INF importing into StarRocks via stream load
09:14PM INF created table <code>main</code>.<code>mysql_orders</code>
09:14PM INF inserted 716540 rows into <code>main</code>.<code>mysql_orders</code> in 40 secs [17,973 r/s] [120 MB]
09:14PM INF execution succeeded</p>
<p>.......</p>
<p>09:15PM INF [5 / 5] running stream "finance"."account_sales"
09:15PM INF connecting to source database (mysql)
09:15PM INF connecting to target database (starrocks)
09:15PM INF reading from source database
09:15PM INF writing to target database [mode: full-refresh]
09:15PM INF streaming data
09:15PM INF importing into StarRocks via stream load
09:15PM INF created table <code>main</code>.<code>finance_account_sales</code>
09:15PM INF inserted 11718 rows into <code>main</code>.<code>finance_account_sales</code> in 0 secs [14,126 r/s] [2.0 MB]
09:15PM INF execution succeeded</p>
<p>09:15PM INF Sling Replication Completed in 1m 34s | mysql -&gt; starrocks | 5 Successes | 0 Failures
</p></code></pre></div></details><p></p><p></p>
<p>So easy! See many more examples here: https://docs.slingdata.io/sling-cli/run/examples</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>We went over on how easy it was to install Sling, and export data from MySQL into StarRocks. Feel free to check out other examples here: https://docs.slingdata.io.</p>
]]></content:encoded></item><item><title><![CDATA[Using JMESPath with Sling for Loading Nested JSON data]]></title><description><![CDATA[Introduction
Sling is an easy-to-use, lightweight data loading tool, typically run from the CLI. It focuses on data movement between Database to Database, File System to Database and Database to File System. See here for the list of Connectors.
Today...]]></description><link>https://blog.slingdata.io/using-jmespath-with-sling-for-loading-nested-json-data</link><guid isPermaLink="true">https://blog.slingdata.io/using-jmespath-with-sling-for-loading-nested-json-data</guid><category><![CDATA[json]]></category><category><![CDATA[ETL]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Wed, 14 Feb 2024 10:05:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1707904965068/8f8e47b7-9644-4248-ae40-6ce1513cc624.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p><a target="_blank" href="https://slingdata.io/">Sling</a> is an easy-to-use, lightweight data loading tool, typically run from the CLI. It focuses on data movement between Database to Database, File System to Database and Database to File System. See <a target="_blank" href="https://slingdata.io/en/connectors">here</a> for the list of Connectors.</p>
<p>Today, we're going to be looking at parsing a complex JSON file (a <a target="_blank" href="https://www.getdbt.com/">dbt</a> <a target="_blank" href="https://docs.getdbt.com/reference/artifacts/manifest-json">manifest file</a>), extracting a sub-set of data and writing it into a CSV file for further analysis.</p>
<h1 id="heading-jmespath">JMESPath</h1>
<p><a target="_blank" href="https://jmespath.org/">JMESPath</a> is the <a target="_blank" href="https://npmtrends.com/JSONPath-vs-automapper-vs-automapper-ts-vs-jmespath-vs-json-query-vs-jsonata-vs-jsonpath-vs-jsonpath-plus-vs-morphism-vs-node-json-transform">most popular</a> query / transformation language for JSON. It has many <a target="_blank" href="https://jmespath.org/libraries.html">libraries</a> ready to use, including Go, which is what Sling is built with. Some key features of JMESPath:</p>
<ol>
<li><p><strong>Filtering and Projection:</strong> You can use JMESPath expressions to filter and project specific elements or attributes from JSON data. This allows you to focus on the relevant parts of a JSON structure.</p>
</li>
<li><p><strong>Functions:</strong> JMESPath includes a set of built-in functions that can be used in expressions for various tasks, such as string manipulation, mathematical operations, and more. These functions enhance the flexibility of JMESPath queries.</p>
</li>
<li><p><strong>Multi-level Queries:</strong> JMESPath supports querying JSON documents with nested structures. You can navigate through arrays and objects to access the data at different levels within the JSON hierarchy.</p>
</li>
<li><p><strong>Pipes and Operators:</strong> JMESPath expressions can include pipes (<code>|</code>) and various operators for combining and transforming data. This allows you to create more complex queries and transformations.</p>
</li>
</ol>
<h1 id="heading-running-sling">Running Sling</h1>
<p>Let us assume we are working from our dbt project folder, and that we have run the <a target="_blank" href="https://docs.getdbt.com/reference/commands/compile"><code>dbt compile</code></a> command. This would have generated a <code>target</code> folder, containing the beefy <code>manifest.json</code> file. We will be extracting the <a target="_blank" href="https://docs.getdbt.com/docs/build/models">models</a> dataset from that file.</p>
<p>After <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">installing</a> sling, we are good to go. From the root of our dbt project (folder containing the <code>dbt_project.yaml</code>), run the following command:</p>
<pre><code class="lang-bash">sling run \
  --src-stream file://./target/manifest.json \
  --tgt-object file://./target/models.csv \
  --src-options <span class="hljs-string">'{
      flatten: true,
      jmespath: "nodes.*.{resource_type: resource_type, database: database, schema: schema, name: name, relation_name: relation_name, original_file_path: original_file_path, materialized: config.materialized }",
    }'</span>
</code></pre>
<p>Let's go over the inputs we provided:</p>
<ul>
<li><p><code>--src-stream</code>: this is the source stream that we want to read from, which is the dbt manifest file in the <code>target</code> folder.</p>
</li>
<li><p><code>--tgt-object</code>: this is the destination file path that we want to write to. Here we are writing to a CSV file, but Sling can write to JSON and Parquet as well. We'd just need to just change the extension to <code>.json</code> or <code>.parquet</code>.</p>
</li>
<li><p><code>--src-options</code>: Here, we specify the <a target="_blank" href="https://docs.slingdata.io/sling-cli/run/configuration#source">source options</a> for Sling to use.</p>
<ul>
<li><p><code>flatten</code>: this indicates sling to flatten on nested data. This basically creates columns for individual nested nodes.</p>
</li>
<li><p><code>jmespath</code>: this is where we define the <a target="_blank" href="https://jmespath.org/">JMESPath</a> transform logic.</p>
</li>
</ul>
</li>
</ul>
<p>Let's take a look at the output.</p>
<pre><code class="lang-bash">6:34AM INF reading from <span class="hljs-built_in">source</span> file system (file)
6:34AM INF writing to target file system (file)
6:34AM INF wrote 41 rows to file://./target/models.csv [1,432 r/s]
6:34AM INF execution succeeded
</code></pre>
<p>Great! Your data is ready for further analysis. Let look at a sample of the output CSV file:</p>
<pre><code class="lang-bash">$ head ./target/models.csv
database,materialized,name,original_file_path,relation_name,resource_type,schema
MY_DATABASE,incremental,track_events_raw,models/track_events_raw.sql,MY_DATABASE.dbt_dev.track_events_raw,model,dbt_dev
MY_DATABASE,<span class="hljs-built_in">test</span>,test_mapping_global_graph_uuid,tests/test_mapping_global_graph_uuid.sql,,<span class="hljs-built_in">test</span>,dev_dbt_test__audit
MY_DATABASE,incremental,track_events,models/track_events.sql,MY_DATABASE.dbt_dev.track_events,model,dbt_dev
MY_DATABASE,table,mapping_invalid_shopify,models/mapping/mapping_invalid_shopify.sql,MY_DATABASE.dbt_dev.mapping_invalid_shopify,model,dbt_dev
MY_DATABASE,table,mapping_global,models/mapping/mapping_global.sql,MY_DATABASE.dbt_dev.mapping_global,model,dbt_dev
MY_DATABASE,<span class="hljs-built_in">test</span>,not_null_edges_edge_b,models/schema.yml,,<span class="hljs-built_in">test</span>,dev_dbt_test__audit
MY_DATABASE,<span class="hljs-built_in">test</span>,unique_id_graph_edge,models/schema.yml,,<span class="hljs-built_in">test</span>,dev_dbt_test__audit
MY_DATABASE,table,mapping_invalid_anonymous,models/mapping/mapping_invalid_anonymous.sql,MY_DATABASE.dbt_dev.mapping_invalid_anonymous,model,dbt_dev
MY_DATABASE,<span class="hljs-built_in">test</span>,test_mapping_global_graph_unique,tests/test_mapping_global_graph_unique.sql,,<span class="hljs-built_in">test</span>,dev_dbt_test__audit
</code></pre>
<p>Neat! In seconds, we were able to create a perfect CSV file, with the columns that we want from the JSON nested nodes. And that's not it, sling can readily load this into a target database as well. See the <a target="_blank" href="https://docs.slingdata.io/sling-cli/run">docs</a> for more examples!</p>
]]></content:encoded></item><item><title><![CDATA[How does Sling deal with schema evolution?]]></title><description><![CDATA[Introduction
Sling is the go-to tool to quickly extract and load data, whether you're handling CSV, Parquet, Avro or JSON files, moving data across databases, or even extracting custom queries, right from your command line. For more details, feel fre...]]></description><link>https://blog.slingdata.io/how-does-sling-deal-with-schema-evolution</link><guid isPermaLink="true">https://blog.slingdata.io/how-does-sling-deal-with-schema-evolution</guid><category><![CDATA[Databases]]></category><category><![CDATA[ETL]]></category><category><![CDATA[data]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Wed, 14 Feb 2024 02:31:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/gL7oJLJOb_I/upload/6fb43778317b2e40c0acd2fdec606fc3.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Sling is the go-to tool to quickly extract and load data, whether you're handling CSV, Parquet, Avro or JSON files, moving data across databases, or even extracting custom queries, right from your command line. For more details, feel free to browse the <a target="_blank" href="https://docs.slingdata.io/">docs</a>. When it comes to a drifting table schema, Sling is no exception in detecting those changes.</p>
<h2 id="heading-what-is-schema-evolution">What is schema evolution?</h2>
<p>To put it simply, schema evolution in databases refers to the process of modifying the structure of a database schema over time to accommodate changes in requirements, business rules, or application enhancements. This is something that happens frequently, and needs to be handled carefully.</p>
<h2 id="heading-slinging-like-a-champ">Slinging like a champ</h2>
<p>When using Sling to extract/load data in a <a target="_blank" href="https://docs.slingdata.io/sling-cli/run/configuration#incremental-mode-strategies"><code>incremental</code></a> manner, it will attempt to match whatever columns are present in both the source stream and target table. If an extra column is present in the source stream, it will add it in the target table. If no columns match from source stream at all, it will error. At least the <code>primary_key</code> or <code>update_key</code> must be present in the target table.</p>
<p>See below for a simple example, mimicking the addition and removal of columns.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Initial data</span>

$ <span class="hljs-built_in">echo</span> <span class="hljs-string">'a,b,c
1,2,3
4,5,6'</span> &gt; test1.csv

$ sling run \
  --src-stream file://./test1.csv \
  --tgt-conn postgres \
  --tgt-object public.test1

&lt;...<span class="hljs-built_in">log</span> output omitted&gt;

$ sling run \
  --src-conn postgres \
  --src-stream public.test1 \
  --stdout

a,b,c,_sling_loaded_at
1,2,3,1707869559
4,5,6,1707869559
</code></pre>
<pre><code class="lang-bash"><span class="hljs-comment"># test2.csv is missing column b</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">'a,c
7,8'</span> &gt; test2.csv

$ sling run \
  --src-stream file://./test2.csv \
  --tgt-conn postgres \
  --tgt-object public.test1 \
  --mode incremental \
  --primary-key a

&lt;...<span class="hljs-built_in">log</span> output omitted&gt;

$ sling run \
  --src-conn postgres \
  --src-stream public.test1 \
  --stdout

a,b,c,_sling_loaded_at
1,2,3,1707869559
4,5,6,1707869559
7,,8,1707869689
</code></pre>
<pre><code class="lang-bash"><span class="hljs-comment"># test3.csv is missing column b, c and has extra column d</span>

$ <span class="hljs-built_in">echo</span> <span class="hljs-string">'a,d
9,10'</span> &gt; test3.csv

$ sling run \
  --src-stream file://./test3.csv \
  --tgt-conn postgres \
  --tgt-object public.test1 \
  --mode incremental \
  --primary-key a

&lt;...<span class="hljs-built_in">log</span> output omitted&gt;

$ sling run \
  --src-conn postgres \
  --src-stream public.test1 \
  --stdout

a,b,c,_sling_loaded_at,d
1,2,3,1707869559,
4,5,6,1707869559,
7,,8,1707869689,
9,,,1707870320,10
</code></pre>
<p>We can see that sling handled the changes properly, in a non-destructive manner. If the source stream were from a database, the same rules would apply, whether a column disappeared or appeared.</p>
]]></content:encoded></item><item><title><![CDATA[See how to conveniently import JSON Files into MySQL]]></title><description><![CDATA[MySQL and JSON
While MySQL is no longer a young buck, it is surprisingly still a widely-used open-source relational database management system (RDBMS). Some of the key features of MySQL include its ability to support high-performance data handling an...]]></description><link>https://blog.slingdata.io/import-json-files-into-mysql</link><guid isPermaLink="true">https://blog.slingdata.io/import-json-files-into-mysql</guid><category><![CDATA[MySQL]]></category><category><![CDATA[ETL]]></category><category><![CDATA[json]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Wed, 14 Dec 2022 15:10:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1670466365513/4k0ntAeJc.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-mysql-and-json">MySQL and JSON</h1>
<p>While MySQL is no longer a young buck, it is surprisingly still a widely-used open-source relational database management system (RDBMS). Some of the key features of MySQL include its ability to support high-performance data handling and large-scale data processing, its support for multiple storage engines, and its ability to be easily integrated with other software and applications.</p>
<p>JSON on the other hand, is a lightweight, human-readable, and easy-to-use data interchange format which is based on a subset of the JavaScript programming language. One of the key advantages of using JSON is that it is easy to work with, both for developers and for machines. It is a hierarchical data format, which means that data is organized into a tree-like structure with nested elements. This makes it easy to represent complex data structures and relationships, and allows for efficient data manipulation and querying. It is used everywhere these days!</p>
<h1 id="heading-your-loading-situation">Your Loading Situation</h1>
<p>So you are in a situation where you not only need to load your JSON files into a MySQL database, but you also need to flatten each file. Well, wonder no more! Sling can be a useful tool to help you accomplish this task, as it is a command-line tool that allows you to efficiently transfer data between files and databases.</p>
<p>By using Sling, it can flatten your JSON file, auto-create your table DDL and then load the data into it from your local drive. This can be a quick and easy way to get your data into a format that is ready for analysis and processing.</p>
<h1 id="heading-installing-sling-cli">Installing Sling CLI</h1>
<p>Sling is very easy to install, no matter which operating system you are using. It's written in the <code>go</code> programming language, so it compiles into a single binary file. It also supports many databases. Please see <a target="_blank" href="https://docs.slingdata.io/connections/file-connections">here</a> for the full list of compatible connectors.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># On Linux</span>
curl -LO <span class="hljs-string">'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz'</span> \
  &amp;&amp; tar xf sling_linux_amd64.tar.gz \
  &amp;&amp; rm -f sling_linux_amd64.tar.gz \
  &amp;&amp; chmod +x sling
</code></pre>
<p>Please see <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> for additional installation options (such as downloading binaries). There is also a <a target="_blank" href="https://pypi.org/project/sling/">Python wrapper</a> library, which is useful if you prefer interacting with Sling inside of Python. Once installed, we should be able to run the <code>sling</code> command.</p>
<h1 id="heading-loading-from-our-local-drive">Loading from our Local Drive</h1>
<p>Let us assume that we desire to ingest the following JSON array file, which includes nested objects:</p>
<pre><code class="lang-json">[
 {
   <span class="hljs-attr">"_id"</span>: <span class="hljs-string">"638f4cab1c024be3cadd3ca5"</span>,
   <span class="hljs-attr">"isActive"</span>: <span class="hljs-literal">true</span>,
   <span class="hljs-attr">"balance"</span>: <span class="hljs-string">"3,148.57"</span>,
   <span class="hljs-attr">"picture"</span>: <span class="hljs-string">"http://placehold.it/32x32"</span>,
   <span class="hljs-attr">"age"</span>: <span class="hljs-number">35</span>,
   <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Joann Kim"</span>,
   <span class="hljs-attr">"company"</span>: {
     <span class="hljs-attr">"name"</span>: <span class="hljs-string">"PROXSOFT"</span>,
     <span class="hljs-attr">"email"</span>: <span class="hljs-string">"joannkim@proxsoft.com"</span>,
     <span class="hljs-attr">"phone"</span>: <span class="hljs-string">"+1 (836) 517-2388"</span>,
     <span class="hljs-attr">"address"</span>: <span class="hljs-string">"951 Ellery Street, Norwood, Palau, 2947"</span>,
     <span class="hljs-attr">"about"</span>: <span class="hljs-string">"Labore id et sunt cupidatat dolore aute. Sit laborum nulla pariatur nisi dolore consectetur ex exercitation cupidatat ex ex reprehenderit duis. Eiusmod ut aliquip laborum enim proident ex cupidatat ut velit qui amet dolor tempor enim.\r\n"</span>,
     <span class="hljs-attr">"registered"</span>: <span class="hljs-string">"2020-01-15T02:07:00 +03:00"</span>,
     <span class="hljs-attr">"latitude"</span>: <span class="hljs-number">-60.00954</span>,
     <span class="hljs-attr">"longitude"</span>: <span class="hljs-number">-55.92312</span>
   },
   <span class="hljs-attr">"tags"</span>: [
     <span class="hljs-string">"elit"</span>,
     <span class="hljs-string">"veniam"</span>
   ]
 },
...
]
</code></pre>
<p>In our example, our file will be located at path <code>/tmp/records.json</code>. And if you'd like to ingest various similarly structured files inside a folder (say <code>/path/to/my/folder</code>), you could just input that instead of the file path, Sling will read all files in the folder! Only make sure to add the <code>file://</code> prefix. See below.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># first let's set our MYSQL connection. Sling can pick up connection URLs from environment variables</span>
$ <span class="hljs-built_in">export</span> MYSQL=<span class="hljs-string">' mysql://admin:password@mysql.host:3306/mysql'</span>

<span class="hljs-comment"># let's check and test our MYSQL connection</span>
$ sling conns list
+------------+-------------+---------------+
| CONN NAME  | CONN TYPE   | SOURCE        |
+------------+-------------+---------------+
| MYSQL      | DB - MySQL  | env variable  |
+------------+-------------+---------------+

$ sling conns <span class="hljs-built_in">test</span> MYSQL
11:19PM INF success!

<span class="hljs-comment"># awesome, now we can run our task</span>
$ sling run --src-stream file:///tmp/records.json --tgt-conn MYSQL --tgt-object mysql.records --mode full-refresh
11:19PM INF connecting to target database (mysql)
11:19PM INF reading from <span class="hljs-built_in">source</span> file system (file)
11:19PM INF writing to target database [mode: full-refresh]
11:19PM INF streaming data
11:19PM INF dropped table mysql.records
11:19PM INF created table mysql.records
11:19PM INF inserted 500 rows <span class="hljs-keyword">in</span> 2 secs
11:19PM INF execution succeeded
</code></pre>
<p>How easy was that? Now let's repeat this again with <code>debug</code> mode enabled (flag <code>-d</code>) to see the created table DDL, and this time let's pipe in the data with the <a target="_blank" href="https://en.wikipedia.org/wiki/Cat_(Unix)"><code>cat</code></a> command:</p>
<pre><code class="lang-bash">$ cat /tmp/records.json | sling run -d --tgt-conn MYSQL --tgt-object mysql.records --mode full-refresh
11:20PM INF connecting to target database (mysql)
11:20PM INF reading from stream (stdin)
11:20PM INF writing to target database [mode: full-refresh]
11:20PM DBG drop table <span class="hljs-keyword">if</span> exists mysql.records_tmp
11:20PM DBG table mysql.records_tmp dropped
11:20PM DBG create table <span class="hljs-keyword">if</span> not exists mysql.records_tmp (`data` json)
11:20PM INF streaming data
11:20PM DBG select count(1) cnt from mysql.records_tmp
11:20PM DBG drop table <span class="hljs-keyword">if</span> exists mysql.records
11:20PM DBG table mysql.records dropped
11:20PM INF dropped table mysql.records
11:20PM DBG create table <span class="hljs-keyword">if</span> not exists mysql.records (`data` json)
11:20PM INF created table mysql.records
11:20PM DBG insert into `mysql`.`records` (`data`) select `data` from `mysql`.`records_tmp`
11:20PM DBG inserted rows into `mysql.records` from temp table `mysql.records_tmp`
11:20PM INF inserted 500 rows <span class="hljs-keyword">in</span> 2 secs [170 r/s]
11:20PM DBG drop table <span class="hljs-keyword">if</span> exists mysql.records_tmp
11:20PM DBG table mysql.records_tmp dropped
11:20PM INF execution succeeded
</code></pre>
<p>We can see that the DDL used was <code>create table if not exists mysql.records (`data` json)</code>.</p>
<h1 id="heading-flattening-our-json-records">Flattening our JSON Records</h1>
<p>This time, let's flatten the records when ingesting. When we flatten, Sling will create individual columns for each of the keys in the record. We can do so by adding <code>--src-options 'flatten: true'</code> as a flag. See <a target="_blank" href="https://docs.slingdata.io/sling-cli/configuration">here</a> for all options:</p>
<pre><code class="lang-bash">$ sling run -d --src-stream file:///tmp/records.json --src-options <span class="hljs-string">'flatten: true'</span> --tgt-conn MYSQL --tgt-object mysql.records --mode full-refresh
11:23PM INF connecting to target database (mysql)
11:23PM INF reading from <span class="hljs-built_in">source</span> file system (file)
11:23PM DBG reading datastream from /tmp/records.json
11:23PM INF writing to target database [mode: full-refresh]
11:23PM DBG drop table <span class="hljs-keyword">if</span> exists mysql.records_tmp
11:23PM DBG table mysql.records_tmp dropped
11:23PM DBG create table <span class="hljs-keyword">if</span> not exists mysql.records_tmp (`_id` mediumtext,
`age` bigint,
`balance` mediumtext,
`company__about` mediumtext,
`company__address` mediumtext,
`company__email` mediumtext,
`company__latitude` decimal(30,9),
`company__longitude` decimal(30,9),
`company__name` mediumtext,
`company__phone` mediumtext,
`company__registered` mediumtext,
`isactive` char(5),
`name` mediumtext,
`picture` mediumtext,
`tags` json,
`_sling_loaded_at` bigint)
11:23PM INF streaming data
11:23PM DBG select count(1) cnt from mysql.records_tmp
11:23PM DBG drop table <span class="hljs-keyword">if</span> exists mysql.records
11:23PM DBG table mysql.records dropped
11:23PM INF dropped table mysql.records
11:23PM DBG create table <span class="hljs-keyword">if</span> not exists mysql.records (`_id` mediumtext,
`age` bigint,
`balance` mediumtext,
`company__about` mediumtext,
`company__address` mediumtext,
`company__email` mediumtext,
`company__latitude` decimal(30,9),
`company__longitude` decimal(30,9),
`company__name` mediumtext,
`company__phone` mediumtext,
`company__registered` mediumtext,
`isactive` varchar(255),
`name` mediumtext,
`picture` mediumtext,
`tags` json,
`_sling_loaded_at` bigint)
11:23PM INF created table mysql.records
11:23PM DBG insert into `mysql`.`records` (`_id`, `age`, `balance`, `company__about`, `company__address`, `company__email`, `company__latitude`, `company__longitude`, `company__name`, `company__phone`, `company__registered`, `isactive`, `name`, `picture`, `tags`, `_sling_loaded_at`) select `_id`, `age`, `balance`, `company__about`, `company__address`, `company__email`, `company__latitude`, `company__longitude`, `company__name`, `company__phone`, `company__registered`, `isactive`, `name`, `picture`, `tags`, `_sling_loaded_at` from `mysql`.`records_tmp`
11:23PM DBG inserted rows into `mysql.records` from temp table `mysql.records_tmp`
11:23PM INF inserted 500 rows <span class="hljs-keyword">in</span> 3 secs [151 r/s]
11:23PM DBG drop table <span class="hljs-keyword">if</span> exists mysql.records_tmp
11:23PM DBG table mysql.records_tmp dropped
11:23PM INF execution succeeded
</code></pre>
<p>Amazing, we can see the DDL now is:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> mysql.records (<span class="hljs-string">`_id`</span> mediumtext,
<span class="hljs-string">`age`</span> <span class="hljs-built_in">bigint</span>,
<span class="hljs-string">`balance`</span> mediumtext,
<span class="hljs-string">`company__about`</span> mediumtext,
<span class="hljs-string">`company__address`</span> mediumtext,
<span class="hljs-string">`company__email`</span> mediumtext,
<span class="hljs-string">`company__latitude`</span> <span class="hljs-built_in">decimal</span>(<span class="hljs-number">30</span>,<span class="hljs-number">9</span>),
<span class="hljs-string">`company__longitude`</span> <span class="hljs-built_in">decimal</span>(<span class="hljs-number">30</span>,<span class="hljs-number">9</span>),
<span class="hljs-string">`company__name`</span> mediumtext,
<span class="hljs-string">`company__phone`</span> mediumtext,
<span class="hljs-string">`company__registered`</span> mediumtext,
<span class="hljs-string">`isactive`</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">`name`</span> mediumtext,
<span class="hljs-string">`picture`</span> mediumtext,
<span class="hljs-string">`tags`</span> <span class="hljs-keyword">json</span>,
<span class="hljs-string">`_sling_loaded_at`</span> <span class="hljs-built_in">bigint</span>)
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>As demonstrated, Sling has a wide compatibility with various storage systems. You can not only ingest JSON files, but CSV, XML files as well as various database systems. If you have any questions, comments and/or facing issues, please feel free to email us at <code>support</code> @ <code>slingdata.io</code>.</p>
]]></content:encoded></item><item><title><![CDATA[How to Import JSON files From AWS S3 into BigQuery]]></title><description><![CDATA[BigQuery & Your S3 Files
BigQuery is a cloud-based, fully managed, serverless data warehouse that enables you to analyze large and complex datasets using SQL queries. It is a scalable, highly-performant, and cost-effective solution that can handle pe...]]></description><link>https://blog.slingdata.io/import-json-files-from-s3-into-bigquery</link><guid isPermaLink="true">https://blog.slingdata.io/import-json-files-from-s3-into-bigquery</guid><category><![CDATA[bigquery]]></category><category><![CDATA[Amazon S3]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[ETL]]></category><category><![CDATA[json]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Mon, 12 Dec 2022 14:31:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1670463772422/snRcl07jh.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-bigquery-amp-your-s3-files">BigQuery &amp; Your S3 Files</h1>
<p>BigQuery is a cloud-based, fully managed, serverless data warehouse that enables you to analyze large and complex datasets using SQL queries. It is a scalable, highly-performant, and cost-effective solution that can handle petabyte-scale data with ease. A, BigQuery is a part of the Google Cloud Platform (GCP) and is built on top of Google's infrastructure, it is highly reliable and available.</p>
<p>However, suppose that you have files that you wish to ingest into BigQuery, but the files are located on AWS S3. In fact, they could be located on any S3 Compatible systems, such as AWS S3, <a target="_blank" href="https://www.digitalocean.com/products/spaces">DigitalOcean Spaces</a>, <a target="_blank" href="https://www.backblaze.com/b2/cloud-storage.html">BackBlaze B2</a>, <a target="_blank" href="https://www.cloudflare.com/products/r2/">Cloudflare R2</a>, <a target="_blank" href="https://min.io/">MinIO</a>, <a target="_blank" href="https://wasabi.com/">Wasabi</a>. What to do? Manually transfer all those to a Google Cloud Storage bucket? How about the destination table that needs to be created? Will you manually define all the columns that are part of the JSON file? There is a better way, use Sling.</p>
<h1 id="heading-sling-cli">Sling CLI</h1>
<p>Sling CLI is a versatile tool that enables you to transfer data quickly and efficiently between files and databases, as well as between databases. It is capable of reading files in the JSON format, among other file types. Please see <a target="_blank" href="https://docs.slingdata.io/connections/file-connections">here</a> for the full list of compatible connectors.</p>
<h2 id="heading-installation">Installation</h2>
<p>Installing Sling is straightforward, regardless of the operating system you are using. This is because it is built using the <code>go</code> programming language and compiles to a single binary file, making it easy to install and run on any system.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># On Linux</span>
curl -LO <span class="hljs-string">'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz'</span> \
  &amp;&amp; tar xf sling_linux_amd64.tar.gz \
  &amp;&amp; rm -f sling_linux_amd64.tar.gz \
  &amp;&amp; chmod +x sling
</code></pre>
<p>Please see <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> for additional installation options (such as downloading binaries). Additionally, there is a <a target="_blank" href="https://pypi.org/project/sling/">Python wrapper</a> library available for Sling that allows you to interact with it from within a Python environment. This can be useful if you prefer to work with Python for data management tasks.</p>
<p>Once installed, you should also be able to use the <code>sling</code> command to access Sling's functionality from the command line.</p>
<h2 id="heading-loading-the-data">Loading the data</h2>
<p>Let us assume that we desire to ingest the following JSON array file, which includes nested objects:</p>
<pre><code class="lang-json">[
 {
   <span class="hljs-attr">"_id"</span>: <span class="hljs-string">"638f4cabc6a57ef261c4d28c"</span>,
   <span class="hljs-attr">"isActive"</span>: <span class="hljs-literal">false</span>,
   <span class="hljs-attr">"balance"</span>: <span class="hljs-string">"3,560.61"</span>,
   <span class="hljs-attr">"picture"</span>: <span class="hljs-string">"http://placehold.it/32x32"</span>,
   <span class="hljs-attr">"age"</span>: <span class="hljs-number">26</span>,
   <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Winters Koch"</span>,
   <span class="hljs-attr">"company"</span>: {
     <span class="hljs-attr">"name"</span>: <span class="hljs-string">"EXOZENT"</span>,
     <span class="hljs-attr">"email"</span>: <span class="hljs-string">"winterskoch@exozent.com"</span>,
     <span class="hljs-attr">"phone"</span>: <span class="hljs-string">"+1 (808) 519-3916"</span>,
     <span class="hljs-attr">"address"</span>: <span class="hljs-string">"744 Strickland Avenue, Mooresburg, Connecticut, 1850"</span>,
     <span class="hljs-attr">"about"</span>: <span class="hljs-string">"Aute cupidatat sint incididunt ullamco sint velit consectetur consectetur nostrud eiusmod velit nostrud voluptate ipsum. Et velit officia excepteur excepteur eu nisi occaecat. Sunt culpa eu nostrud in sint occaecat nulla labore pariatur adipisicing ex irure voluptate qui.\r\n"</span>,
     <span class="hljs-attr">"registered"</span>: <span class="hljs-string">"2020-07-12T09:54:18 +03:00"</span>,
     <span class="hljs-attr">"latitude"</span>: <span class="hljs-number">43.588661</span>,
     <span class="hljs-attr">"longitude"</span>: <span class="hljs-number">-136.979445</span>
   },
   <span class="hljs-attr">"tags"</span>: [
     <span class="hljs-string">"deserunt"</span>,
     <span class="hljs-string">"duis"</span>
   ]
 },
...
]
</code></pre>
<p>Let us also assume that the file is located on our S3 bucket (at path <code>s3://sling-bucket/records.json</code>), it’s as easy as running the following commands.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># let's set our connections. We can set them with `sling conns`</span>
$ sling conns <span class="hljs-built_in">set</span> AWS_S3 <span class="hljs-built_in">type</span>=s3 bucket=sling-bucket access_key_id=<span class="hljs-variable">$ACCESS_KEY_ID</span> secret_access_key=<span class="hljs-string">"<span class="hljs-variable">$SECRET_ACCESS_KEY</span>"</span>
10:14PM INF connection `AWS_S3` has been <span class="hljs-built_in">set</span> <span class="hljs-keyword">in</span> /Users/fritz/.sling/env.yaml. Please <span class="hljs-built_in">test</span> with `sling conns <span class="hljs-built_in">test</span> AWS_S3`

$ sling conns <span class="hljs-built_in">set</span> BIGQUERY <span class="hljs-built_in">type</span>=bigquery project=my-project dataset=public key_file=/path/to/service.account.json
10:14PM INF connection `BIGQUERY` has been <span class="hljs-built_in">set</span> <span class="hljs-keyword">in</span> /Users/fritz/.sling/env.yaml. Please <span class="hljs-built_in">test</span> with `sling conns <span class="hljs-built_in">test</span> BIGQUERY`

<span class="hljs-comment"># let's check and test our connections</span>
$ sling conns list
+------------+------------------+-----------------+
| CONN NAME  | CONN TYPE        | SOURCE          |
+------------+------------------+-----------------+
| AWS_S3     | FileSys - S3     | sling env yaml  |
| BIGQUERY   | DB - BigQuery    | sling env yaml  |
+------------+------------------+-----------------+

$ sling conns <span class="hljs-built_in">test</span> AWS_S3
10:15PM INF success!

$ sling conns <span class="hljs-built_in">test</span> BIGQUERY
10:15PM INF success!

<span class="hljs-comment"># great, now we can run our task</span>
$ sling run --src-conn AWS_S3 --src-stream s3://sling-bucket/records.json --tgt-conn BIGQUERY --tgt-object public.records --mode full-refresh
10:18PM INF connecting to target database (bigquery)
10:18PM INF reading from <span class="hljs-built_in">source</span> file system (s3)
10:18PM INF writing to target database [mode: full-refresh]
10:18PM INF streaming data
10:18PM INF importing into bigquery via <span class="hljs-built_in">local</span> storage
10:18PM INF dropped table public.records
10:18PM INF created table public.records
10:18PM INF inserted 500 rows <span class="hljs-keyword">in</span> 9 secs
10:18PM INF execution succeeded
</code></pre>
<p>Cool huh? We didn't even have to set anything up. If we were to ingest numerous files inside a folder (say <code>s3://sling-bucket/path/to/my/folder</code>), we could just input that instead of the file path, Sling will read all files in the folder.</p>
<h2 id="heading-flattening-our-json-records">Flattening our JSON Records</h2>
<p>To flatten a JSON file when using Sling, you can specify the <code>flatten: true</code> in the source option as a flag. This will cause Sling to create individual columns for each key in the records, rather than storing the data in a nested format. For more information on Sling's configuration options, please see the following documentation: https://docs.slingdata.io/sling-cli/configuration. To specify the <code>flatten: true</code> option, you would use the <code>--src-options</code> flag, as shown below. We'll also use the <code>-d</code> to turn on <code>debug</code> mode, so we can see the DDL being created.</p>
<pre><code class="lang-bash">$ sling run -d --src-conn AWS_S3 --src-stream s3://sling-bucket/records.json --src-options <span class="hljs-string">'flatten: true'</span> --tgt-conn BIGQUERY --tgt-object public.records --mode full-refresh
10:23PM INF connecting to target database (bigquery)
10:23PM INF reading from <span class="hljs-built_in">source</span> file system (s3)
10:23PM DBG reading datastream from s3://sling-bucket/records.json
10:23PM INF writing to target database [mode: full-refresh]
10:23PM DBG drop table <span class="hljs-keyword">if</span> exists public.records_tmp
10:23PM DBG table public.records_tmp dropped
10:23PM DBG create table public.records_tmp (`_id` string,
`age` int64,
`balance` string,
`company__about` string,
`company__address` string,
`company__email` string,
`company__latitude` numeric,
`company__longitude` numeric,
`company__name` string,
`company__phone` string,
`company__registered` string,
`isactive` bool,
`name` string,
`picture` string,
`tags` json,
`_sling_loaded_at` int64)
10:23PM INF streaming data
10:23PM INF importing into bigquery via <span class="hljs-built_in">local</span> storage
10:23PM DBG Loading /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/bigquery/public.records_tmp/2022-12-07T222336.731/part.01.0001.csv.gz
10:23PM DBG select count(1) cnt from public.records_tmp
10:23PM DBG comparing checksums []string{<span class="hljs-string">"_id"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"balance"</span>, <span class="hljs-string">"company__about"</span>, <span class="hljs-string">"company__address"</span>, <span class="hljs-string">"company__email"</span>, <span class="hljs-string">"company__latitude"</span>, <span class="hljs-string">"company__longitude"</span>, <span class="hljs-string">"company__name"</span>, <span class="hljs-string">"company__phone"</span>, <span class="hljs-string">"company__registered"</span>, <span class="hljs-string">"isactive"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"picture"</span>, <span class="hljs-string">"tags"</span>, <span class="hljs-string">"_sling_loaded_at"</span>} vs []string{<span class="hljs-string">"_id"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"balance"</span>, <span class="hljs-string">"company__about"</span>, <span class="hljs-string">"company__address"</span>, <span class="hljs-string">"company__email"</span>, <span class="hljs-string">"company__latitude"</span>, <span class="hljs-string">"company__longitude"</span>, <span class="hljs-string">"company__name"</span>, <span class="hljs-string">"company__phone"</span>, <span class="hljs-string">"company__registered"</span>, <span class="hljs-string">"isActive"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"picture"</span>, <span class="hljs-string">"tags"</span>, <span class="hljs-string">"_sling_loaded_at"</span>}: []string{<span class="hljs-string">"_id"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"balance"</span>, <span class="hljs-string">"company__about"</span>, <span class="hljs-string">"company__address"</span>, <span class="hljs-string">"company__email"</span>, <span class="hljs-string">"company__latitude"</span>, <span class="hljs-string">"company__longitude"</span>, <span class="hljs-string">"company__name"</span>, <span class="hljs-string">"company__phone"</span>, <span class="hljs-string">"company__registered"</span>, <span class="hljs-string">"isactive"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"picture"</span>, <span class="hljs-string">"tags"</span>, <span class="hljs-string">"_sling_loaded_at"</span>}
10:23PM DBG drop table <span class="hljs-keyword">if</span> exists public.records
10:23PM DBG table public.records dropped
10:23PM INF dropped table public.records
10:23PM DBG create table public.records (`_id` string,
`age` int64,
`balance` string,
`company__about` string,
`company__address` string,
`company__email` string,
`company__latitude` numeric,
`company__longitude` numeric,
`company__name` string,
`company__phone` string,
`company__registered` string,
`isactive` bool,
`name` string,
`picture` string,
`tags` json,
`_sling_loaded_at` int64)
10:23PM INF created table public.records
10:23PM DBG insert into `public`.`records` (`_id`, `age`, `balance`, `company__about`, `company__address`, `company__email`, `company__latitude`, `company__longitude`, `company__name`, `company__phone`, `company__registered`, `isactive`, `name`, `picture`, `tags`, `_sling_loaded_at`) select `_id`, `age`, `balance`, `company__about`, `company__address`, `company__email`, `company__latitude`, `company__longitude`, `company__name`, `company__phone`, `company__registered`, `isactive`, `name`, `picture`, `tags`, `_sling_loaded_at` from `public`.`records_tmp`
10:23PM DBG inserted rows into `public.records` from temp table `public.records_tmp`
10:23PM INF inserted 500 rows <span class="hljs-keyword">in</span> 25 secs
10:23PM DBG drop table <span class="hljs-keyword">if</span> exists public.records_tmp
10:23PM DBG table public.records_tmp dropped
10:23PM INF execution succeeded
</code></pre>
<p>Wonderful, notice the DDL:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> public.records (<span class="hljs-string">`_id`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`age`</span> int64,
<span class="hljs-string">`balance`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`company__about`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`company__address`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`company__email`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`company__latitude`</span> <span class="hljs-built_in">numeric</span>,
<span class="hljs-string">`company__longitude`</span> <span class="hljs-built_in">numeric</span>,
<span class="hljs-string">`company__name`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`company__phone`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`company__registered`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`isactive`</span> <span class="hljs-built_in">bool</span>,
<span class="hljs-string">`name`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`picture`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`tags`</span> <span class="hljs-keyword">json</span>,
<span class="hljs-string">`_sling_loaded_at`</span> int64)
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>As demonstrated, Sling has a wide compatibility with various storage systems. You can not only ingest JSON files from cloud systems, but CSV, XML files as well as various database systems (such as PostgreSQL, RedShift, Oracle, SQL Server, Snowflake, and more). Give it a spin! If you have any questions, comments and/or facing issues, please feel free to email us at <code>support</code> @ <code>slingdata.io</code>.</p>
]]></content:encoded></item><item><title><![CDATA[How to Quickly Import Local JSON files into PostgreSQL]]></title><description><![CDATA[The Most Versatile Database System
PostgreSQL was not always as popular as it is today. Over the years, the open-source database has increasingly seen adoption and praise as it has been an extremely stable, versatile and cost-efficient system. Many t...]]></description><link>https://blog.slingdata.io/import-local-json-files-into-postgresql</link><guid isPermaLink="true">https://blog.slingdata.io/import-local-json-files-into-postgresql</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[json]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[ETL]]></category><category><![CDATA[automation]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Thu, 08 Dec 2022 15:00:45 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1670460767341/z8IyqT8Ov.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-the-most-versatile-database-system">The Most Versatile Database System</h1>
<p>PostgreSQL was not always as <a target="_blank" href="https://db-engines.com/en/ranking_trend/relational+dbms">popular</a> as it is today. Over the years, the open-source database has increasingly seen adoption and praise as it has been an extremely stable, versatile and cost-efficient system. Many teams even use it as their de facto <a target="_blank" href="https://en.wikipedia.org/wiki/Data_warehouse">Data Warehouse</a>. It can easily handle terabytes of data, which is enough for 90% of businesses out there.</p>
<h1 id="heading-json-file-structure">JSON File Structure</h1>
<p>JSON itself is a very popular file format which grew in popularity with the rise of the internet. Due to its ease of readability and flexibility with any desired tree structure, it is used across systems, especially as a means to deliver messages. It is also often used when storing log messages, so it makes sense to have advanced capabilities in reading JSON files.</p>
<h1 id="heading-using-sling-cli">Using Sling CLI</h1>
<p>Sling CLI is a powerful tool which allows you to efficiently move data from files into databases, from databases to databases, and from databases to files. One of the file type that it can read is JSON. Please see <a target="_blank" href="https://docs.slingdata.io/connections/file-connections">here</a> for the full list of compatible connectors.</p>
<h2 id="heading-installation">Installation</h2>
<p>Installing Sling is super-easy regardless of the operating system that you are using. Since it is built in <code>go</code>, it compiles to a single binary.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># On Linux</span>
curl -LO <span class="hljs-string">'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz'</span> \
  &amp;&amp; tar xf sling_linux_amd64.tar.gz \
  &amp;&amp; rm -f sling_linux_amd64.tar.gz \
  &amp;&amp; chmod +x sling
</code></pre>
<p>Please see <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> for additional installation options (such as downloading binaries). There is also a <a target="_blank" href="https://pypi.org/project/sling/">Python wrapper</a> library, which is useful if you prefer interacting with Sling inside of Python. Once installed, we should be able to run the <code>sling</code> command.</p>
<h2 id="heading-loading-from-local-drive">Loading from Local Drive</h2>
<p>Let us assume that we desire to ingest the following JSON array file, which includes nested objects:</p>
<pre><code class="lang-json">[
 {
   <span class="hljs-attr">"_id"</span>: <span class="hljs-string">"638f4bed04fa9d18613af369"</span>,
   <span class="hljs-attr">"isActive"</span>: <span class="hljs-literal">false</span>,
   <span class="hljs-attr">"balance"</span>: <span class="hljs-string">"1,916.14"</span>,
   <span class="hljs-attr">"picture"</span>: <span class="hljs-string">"http://placehold.it/32x32"</span>,
   <span class="hljs-attr">"age"</span>: <span class="hljs-number">30</span>,
   <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Ebony Jensen"</span>,
   <span class="hljs-attr">"company"</span>: {
     <span class="hljs-attr">"name"</span>: <span class="hljs-string">"ZAPHIRE"</span>,
     <span class="hljs-attr">"email"</span>: <span class="hljs-string">"ebonyjensen@zaphire.com"</span>,
     <span class="hljs-attr">"phone"</span>: <span class="hljs-string">"+1 (980) 571-3202"</span>,
     <span class="hljs-attr">"address"</span>: <span class="hljs-string">"958 Elmwood Avenue, Gouglersville, Massachusetts, 9126"</span>,
     <span class="hljs-attr">"about"</span>: <span class="hljs-string">"Sunt velit adipisicing dolore aliqua aliquip. Amet aliqua Lorem ea est laboris magna. Commodo aliqua sint cupidatat qui ut cupidatat nostrud sunt proident duis sunt commodo eu.\r\n"</span>,
     <span class="hljs-attr">"registered"</span>: <span class="hljs-string">"2018-05-06T02:20:10 +03:00"</span>,
     <span class="hljs-attr">"latitude"</span>: <span class="hljs-number">-1.126455</span>,
     <span class="hljs-attr">"longitude"</span>: <span class="hljs-number">-161.187586</span>
   },
   <span class="hljs-attr">"tags"</span>: [
     <span class="hljs-string">"minim"</span>,
     <span class="hljs-string">"commodo"</span>
   ]
 },
...
]
</code></pre>
<p>If the file is located on our local hard drive (at path <code>/tmp/records.json</code>), it’s as easy as running the following command below. If you'd like to ingest numerous files inside a folder (say <code>/path/to/my/folder</code>), you could just input that instead of the file path, Sling will read all files in the folder. Only make sure to add the <code>file://</code> prefix. See below.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># first let's set our PG connection. Sling can pick up connection URLs from environment variables</span>
$ <span class="hljs-built_in">export</span> POSTGRES=<span class="hljs-string">'postgresql://postgres:myPassword@pghost:5432/postgres?sslmode=disable'</span>

<span class="hljs-comment"># let's check and test our PG connection</span>
$ sling conns list
+------------+------------------+-----------------+
| CONN NAME  | CONN TYPE        | SOURCE          |
+------------+------------------+-----------------+
| POSTGRES   | DB - PostgreSQL  | env variable    |
+------------+------------------+-----------------+

$ sling conns <span class="hljs-built_in">test</span> POSTGRES
8:05AM INF success!

<span class="hljs-comment"># great, now we can run our task</span>
$ sling run --src-stream file:///tmp/records.json --tgt-conn POSTGRES --tgt-object public.records --mode full-refresh
11:09AM INF connecting to target database (postgres)
11:09AM INF reading from <span class="hljs-built_in">source</span> file system (file)
11:09AM INF writing to target database [mode: full-refresh]
11:09AM INF streaming data
11:09AM INF dropped table public.records
11:09AM INF created table public.records
11:09AM INF inserted 500 rows <span class="hljs-keyword">in</span> 0 secs [1,556 r/s]
11:09AM INF execution succeeded
</code></pre>
<p>Easy, huh? Let's do this again in <code>debug</code> mode (with flag <code>-d</code>) to see the created table DDL, and this time let's pipe in the data with the <code>cat</code> command:</p>
<pre><code class="lang-bash">$ cat /tmp/records.json | sling run -d --tgt-conn POSTGRES --tgt-object public.records --mode full-refresh
11:10AM DBG <span class="hljs-built_in">type</span> is file-db
11:10AM INF connecting to target database (postgres)
11:10AM INF reading from stream (stdin)
11:10AM DBG reading datastream from /tmp/records.json
11:10AM INF writing to target database [mode: full-refresh]
11:10AM DBG drop table <span class="hljs-keyword">if</span> exists public.records_tmp cascade
11:10AM DBG table public.records_tmp dropped
11:10AM DBG create table <span class="hljs-keyword">if</span> not exists public.records_tmp (<span class="hljs-string">"data"</span> jsonb)
11:10AM INF streaming data
11:10AM DBG select count(1) cnt from public.records_tmp
11:10AM DBG drop table <span class="hljs-keyword">if</span> exists public.records cascade
11:10AM DBG table public.records dropped
11:10AM INF dropped table public.records
11:10AM DBG create table <span class="hljs-keyword">if</span> not exists public.records (<span class="hljs-string">"data"</span> jsonb)
11:10AM INF created table public.records
11:10AM DBG insert into <span class="hljs-string">"public"</span>.<span class="hljs-string">"records"</span> (<span class="hljs-string">"data"</span>) select <span class="hljs-string">"data"</span> from <span class="hljs-string">"public"</span>.<span class="hljs-string">"records_tmp"</span>
11:10AM DBG inserted rows into `public.records` from temp table `public.records_tmp`
11:10AM INF inserted 500 rows <span class="hljs-keyword">in</span> 0 secs [1,778 r/s]
11:10AM DBG drop table <span class="hljs-keyword">if</span> exists public.records_tmp cascade
11:10AM DBG table public.records_tmp dropped
11:10AM INF execution succeeded
</code></pre>
<p>We can see the DDL is <code>create table if not exists public.records ("data" jsonb)</code>.</p>
<h2 id="heading-flattening-our-json-records">Flattening our JSON Records</h2>
<p>Now, let's flatten the same file when ingesting. When we flatten, Sling will create individual columns for each of the keys in the records. We can do so by adding <code>--src-options 'flatten: true'</code> as a flag. See <a target="_blank" href="https://docs.slingdata.io/sling-cli/configuration">here</a> for all options:</p>
<pre><code class="lang-bash">$ cat /tmp/records.json | sling run -d --src-options <span class="hljs-string">'flatten: true'</span> --tgt-conn POSTGRES --tgt-object public.records --mode full-refresh
11:13AM DBG <span class="hljs-built_in">type</span> is file-db
11:13AM INF connecting to target database (postgres)
11:13AM INF reading from stream (stdin)
11:13AM DBG reading datastream from /tmp/records.json
11:13AM INF writing to target database [mode: full-refresh]
11:13AM DBG drop table <span class="hljs-keyword">if</span> exists public.records_tmp cascade
11:13AM DBG table public.records_tmp dropped
11:13AM DBG create table <span class="hljs-keyword">if</span> not exists public.records_tmp (<span class="hljs-string">"_id"</span> varchar(255),
<span class="hljs-string">"age"</span> <span class="hljs-built_in">integer</span>,
<span class="hljs-string">"balance"</span> varchar(255),
<span class="hljs-string">"company__about"</span> text,
<span class="hljs-string">"company__address"</span> varchar(255),
<span class="hljs-string">"company__email"</span> varchar(255),
<span class="hljs-string">"company__latitude"</span> numeric,
<span class="hljs-string">"company__longitude"</span> numeric,
<span class="hljs-string">"company__name"</span> varchar(255),
<span class="hljs-string">"company__phone"</span> varchar(255),
<span class="hljs-string">"company__registered"</span> varchar(255),
<span class="hljs-string">"isactive"</span> bool,
<span class="hljs-string">"name"</span> varchar(255),
<span class="hljs-string">"picture"</span> varchar(255),
<span class="hljs-string">"tags"</span> jsonb)
11:13AM INF streaming data
11:13AM DBG select count(1) cnt from public.records_tmp
11:13AM DBG drop table <span class="hljs-keyword">if</span> exists public.records cascade
11:13AM DBG table public.records dropped
11:13AM INF dropped table public.records
11:13AM DBG create table <span class="hljs-keyword">if</span> not exists public.records (<span class="hljs-string">"_id"</span> varchar(255),
<span class="hljs-string">"age"</span> <span class="hljs-built_in">integer</span>,
<span class="hljs-string">"balance"</span> varchar(255),
<span class="hljs-string">"company__about"</span> text,
<span class="hljs-string">"company__address"</span> varchar(255),
<span class="hljs-string">"company__email"</span> varchar(255),
<span class="hljs-string">"company__latitude"</span> numeric,
<span class="hljs-string">"company__longitude"</span> numeric,
<span class="hljs-string">"company__name"</span> varchar(255),
<span class="hljs-string">"company__phone"</span> varchar(255),
<span class="hljs-string">"company__registered"</span> varchar(255),
<span class="hljs-string">"isactive"</span> bool,
<span class="hljs-string">"name"</span> varchar(255),
<span class="hljs-string">"picture"</span> varchar(255),
<span class="hljs-string">"tags"</span> jsonb)
11:13AM INF created table public.records
11:13AM DBG insert into <span class="hljs-string">"public"</span>.<span class="hljs-string">"records"</span> (<span class="hljs-string">"_id"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"balance"</span>, <span class="hljs-string">"company__about"</span>, <span class="hljs-string">"company__address"</span>, <span class="hljs-string">"company__email"</span>, <span class="hljs-string">"company__latitude"</span>, <span class="hljs-string">"company__longitude"</span>, <span class="hljs-string">"company__name"</span>, <span class="hljs-string">"company__phone"</span>, <span class="hljs-string">"company__registered"</span>, <span class="hljs-string">"isactive"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"picture"</span>, <span class="hljs-string">"tags"</span>) select <span class="hljs-string">"_id"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"balance"</span>, <span class="hljs-string">"company__about"</span>, <span class="hljs-string">"company__address"</span>, <span class="hljs-string">"company__email"</span>, <span class="hljs-string">"company__latitude"</span>, <span class="hljs-string">"company__longitude"</span>, <span class="hljs-string">"company__name"</span>, <span class="hljs-string">"company__phone"</span>, <span class="hljs-string">"company__registered"</span>, <span class="hljs-string">"isactive"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"picture"</span>, <span class="hljs-string">"tags"</span> from <span class="hljs-string">"public"</span>.<span class="hljs-string">"records_tmp"</span>
11:13AM DBG inserted rows into `public.records` from temp table `public.records_tmp`
11:13AM INF inserted 500 rows <span class="hljs-keyword">in</span> 0 secs [1,153 r/s]
11:13AM DBG drop table <span class="hljs-keyword">if</span> exists public.records_tmp cascade
11:13AM DBG table public.records_tmp dropped
11:13AM INF execution succeeded
</code></pre>
<p>Nice, notice the DDL:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> public.records (<span class="hljs-string">"_id"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"age"</span> <span class="hljs-built_in">integer</span>,
<span class="hljs-string">"balance"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"company__about"</span> <span class="hljs-built_in">text</span>,
<span class="hljs-string">"company__address"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"company__email"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"company__latitude"</span> <span class="hljs-built_in">numeric</span>,
<span class="hljs-string">"company__longitude"</span> <span class="hljs-built_in">numeric</span>,
<span class="hljs-string">"company__name"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"company__phone"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"company__registered"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"isactive"</span> <span class="hljs-built_in">bool</span>,
<span class="hljs-string">"name"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"picture"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"tags"</span> jsonb)
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>As demonstrated, Sling CLI is an effective tool to quickly ingest JSON data. You can not only ingest JSON files, but CSV, XML files as well as various database systems. If you have any questions, or comments and/or facing issues, please feel free to email us at <code>support</code> @ <code>slingdata.io</code>.</p>
]]></content:encoded></item><item><title><![CDATA[Want to Ingest Files into a Database? Use Sling to lessen the pain.]]></title><description><![CDATA[Ingesting Data Files Today
For as long as data practitioners have been loading files into Relational Database Management Systems (RDBMS), it has been necessary to first manually create the destination table, via Data Definition Language (DDL). This p...]]></description><link>https://blog.slingdata.io/ingest-files-into-database</link><guid isPermaLink="true">https://blog.slingdata.io/ingest-files-into-database</guid><category><![CDATA[Databases]]></category><category><![CDATA[ETL]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[json]]></category><category><![CDATA[xml]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Sun, 27 Nov 2022 13:27:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/lRoX0shwjUQ/upload/v1669484131959/dySvawb8k.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-ingesting-data-files-today">Ingesting Data Files Today</h1>
<p>For as long as data practitioners have been loading files into Relational Database Management Systems (RDBMS), it has been necessary to first manually create the destination table, via Data Definition Language (DDL). This process is definitely not consistent, as it can range from being fairly quick to extremely tedious, especially for massive files with lots of columns. And what about unstructured / semi-structured files? Forget about it when data is deeply nested, it can be a huge pain to extract the values into their own proper columns (and determine their proper type). </p>
<h2 id="heading-manually-defining-the-ddl">Manually Defining the DDL</h2>
<p>Let’s go through a typically scenario doing this today. Normally, a tool that would allow you to ingest a CSV file would need to first sample the first hundred or so rows, and have you confirm the column types. Now imagine if this files has 50+ columns! Ok, no biggie, you confirm the column types and the tool starts ingesting the 10 million row file. It’s chugging along well, and 2 million rows in, column 46, which was typed as an <code>integer</code>, suddenly encounters a <code>string</code> value, and the database errors out. Does that sound familiar? It’s a nightmare dealing with this, since the types determined from a sample cannot be guaranteed throughout the rest of the rows in the file. Thankfully, there is a nifty tool called <code>sling</code>.</p>
<h1 id="heading-using-sling-cli-to-load-files-into-your-databases">Using Sling CLI to Load files into your Databases</h1>
<p>Sling CLI is a powerful tool which allows you to efficiently move data from files into databases, from databases to databases, and from databases to files. It also accepts content via the standard input (<code>stdin</code>) and can output via the standard output (<code>stdout</code>) as well. In the next few sections, we will go over a few use-cases of ingesting files into different databases using Sling CLI. You'll notice that we do not have to define a table DDL. Sling automatically does this for us, as it continuously profiles the column values and can adjust the column type as required. </p>
<p>Furthermore, each database system has their own Bulk loading methodology. For example, PostgreSQL has the <a target="_blank" href="https://www.postgresql.org/docs/current/sql-copy.html"><code>COPY</code> function</a> where the client streams in records, Snowflake (SF) has the <a target="_blank" href="https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html"><code>COPY INTO</code> function</a> where SF ingests quickly from a Stage location, and BigQuery have <a target="_blank" href="https://cloud.google.com/bigquery/docs/loading-data">its own APIs</a> for Bulk Loading (such as from Google Cloud Storage or locally from a SDK client stream). Sling ensures to follow each of those methodologies so that data is ingested as efficiently as possible in a streaming fashion (holding small amount in memory).</p>
<p>OK, let us switch gears and get into the meat. We will go over the following use-cases:</p>
<ul>
<li>Loading a Google Cloud Storage CSV file into a PostgreSQL Database</li>
<li>Loading a Local TSV File (via stdin) into a Snowflake Database</li>
<li>Loading an Amazon S3 JSON file into a BigQuery Database</li>
</ul>
<p>But first thing's first, we must install <code>sling</code> on our machine and configure our credentials</p>
<h2 id="heading-installing-sling-on-our-machine">Installing <code>sling</code> on our machine</h2>
<p>Since it is built in Go, it is offered as a binary for whatever system you are using.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># Using Python Wrapper via pip</span>
pip install sling
</code></pre>
<p>If you have a Linux system or desire to download the binary manually, please head over <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started#binary-download">here</a>. Once installed, you should be able to run the <code>sling</code> command.</p>
<h2 id="heading-configuring-our-credentials">Configuring our credentials</h2>
<p>The next step is to set our credentials file (<code>env.yaml</code>), which needs to be located in the <a target="_blank" href="https://docs.slingdata.io/sling-cli/environment#sling-env-file-env.yaml">Sling home folder</a>. The <code>env.yaml</code> is sort of a central location to store all our local credentials. If you want to try one of the use-case demonstrated, simply ensure that the needed connections are in the <code>env.yaml</code> file. For our demonstration purposes, the below <code>env.yaml</code> file contains the following connections:</p>
<ul>
<li><strong>DO_SPACES</strong>: a S3 Compatible bucket file system offered by DigitalOcean. This would also work with any other S3-compatible product such as <a target="_blank" href="https://www.cloudflare.com/products/r2/">CloudFlare R2</a>, <a target="_blank" href="https://min.io/">MinIO</a>, <a target="_blank" href="https://www.backblaze.com/b2/cloud-storage.html">Backblaze B2</a>, and many others including AWS S3.</li>
<li><strong>GOOGLE_STORAGE</strong>: a Google Cloud Storage bucket</li>
<li><strong>POSTGRES</strong>: a PostgreSQL database. This could be any PostgreSQL database, such as <a target="_blank" href="https://aws.amazon.com/pt/rds/postgresql/">Amazon RDS</a>, <a target="_blank" href="https://supabase.com/">Supabase</a>, <a target="_blank" href="https://fly.io/docs/postgres/">Fly.io</a>, <a target="_blank" href="https://www.digitalocean.com/products/managed-databases-postgresql">DigitalOcean</a>, or your own local instance!</li>
<li><strong>SNOWFLAKE</strong>: a Snowflake warehouse instance.</li>
<li><strong>BIGQUERY</strong>: a BigQuery warehouse instance.</li>
</ul>
<details>
<summary><strong>env.yaml</strong></summary>

<pre><code>connections:
  DO_SPACES:
    type: s3
    bucket: sling-bucket
    endpoint: nyc3.digitaloceanspaces.com
    access_key_id: "my_access_key_id"
    secret_access_key: "my_secret_access_key"

  GOOGLE_STORAGE:
    type: gs
    bucket: sling-bucket
    gc_key_file: /Users/fritz/.sling/sling-project-123-ce219ceaef9512.json

  POSTGRES:
    type: postgres
    username: fritz
    host: 10.13.123.234 
    password: "mypass123"
    port: 5432
    database: postgres_db
    schema: public

  BIGQUERY:
    type: bigquery
    project: sling-project-123
    location: US
    dataset: public
    gc_key_file: /Users/fritz/.sling/sling-project-123-ce219ceaef9512.json

  SNOWFLAKE:
    type: snowflake
    account: tzb01234.us-east-1
    username: sling
    password: "mypass123"
    database: sling
    schema: public</code></pre>

</details>

<p>Once the credentials are set, we can list our connections and test them with the <code>sling conns</code> command:</p>
<pre><code class="lang-bash">$ sling conns list
+------------------+------------------+-------------------+
| CONN NAME        | CONN TYPE        | SOURCE            |
+------------------+------------------+-------------------+
| BIGQUERY         | DB - BigQuery    | sling env yaml    |
| DO_SPACES        | FileSys - S3     | sling env yaml    |
| GOOGLE_STORAGE   | FileSys - Google | sling env yaml    |
| POSTGRES         | DB - PostgreSQL  | sling env yaml    |
| SNOWFLAKE        | DB - Snowflake   | sling env yaml    |
+------------------+------------------+-------------------+

$ sling conns <span class="hljs-built_in">test</span> BIGQUERY
3:53PM INF success!

$ sling conns <span class="hljs-built_in">test</span> DO_SPACES
3:54PM INF success!

...
</code></pre>
<p>Great, now that we have tested our connections, let's go through a few examples. Remember, if you are facing any issues with accessing your connections and successfully testing them, feel free to email us (<code>support</code> @ <code>slingdata.io</code>).</p>
<h2 id="heading-loading-a-google-cloud-storage-csv-file-into-a-postgresql-database">Loading a Google Cloud Storage CSV file into a PostgreSQL Database</h2>
<p>Let us say that we need to load files from <a target="_blank" href="https://cloud.google.com/storage">Google Cloud Storage (GCS)</a> into a PostgreSQL instance. The first step to get this flowing is to set our credentials file, as described above.</p>
<p>Now let us test our connections.</p>
<pre><code class="lang-bash">$ sling conns <span class="hljs-built_in">test</span> GOOGLE_STORAGE
10:52AM INF success!

$ sling conns <span class="hljs-built_in">test</span> POSTGRES
10:52AM INF success!
</code></pre>
<p>Let us discover the available file streams with the <code>sling conns discover</code> command:</p>
<pre><code class="lang-bash">$ sling conns discover GOOGLE_STORAGE
10:52AM INF Found 2 streams:
 - gs://sling-bucket/LargeDataset.csv
 - gs://sling-bucket/test_file.csv
</code></pre>
<p>Great, let's run our task. We want to copy the CSV file <code>gs://sling-bucket/test_file.csv</code> into table <code>public.my_table</code>, so we can run this:</p>
<pre><code class="lang-bash">$ sling run --src-conn GOOGLE_STORAGE --src-stream gs://sling-bucket/test_file.csv --tgt-conn POSTGRES --tgt-object public.my_table --mode full-refresh
11:01AM INF connecting to target database (postgres)
11:01AM INF reading from <span class="hljs-built_in">source</span> file system (gs)
11:01AM INF writing to target database [mode: full-refresh]
11:01AM INF streaming data
11:01AM INF dropped table public.my_table
11:01AM INF created table public.my_table
11:01AM INF inserted 18 rows <span class="hljs-keyword">in</span> 1 sec
11:01AM INF execution succeeded
</code></pre>
<p>We could run the same task by using a YAML config file, as well as enable <code>debug</code> mode (with flag <code>-d</code>) for more verbose logging:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Let's create our task configuration file</span>
$ <span class="hljs-built_in">echo</span> <span class="hljs-string">'
source:
  conn: GOOGLE_STORAGE
  stream: "gs://sling-bucket/test_file.csv"
target:
  conn: POSTGRES
  object: public.my_table
mode: full-refresh
'</span> &gt; /tmp/my_task.yaml

<span class="hljs-comment"># we are also providing the `-d` flag for debug mode.</span>
<span class="hljs-comment"># this allows us to have more logging details.</span>
$ sling run -d -c /tmp/my_task.yaml
11:04AM DBG <span class="hljs-built_in">type</span> is file-db
11:04AM INF connecting to target database (postgres)
11:04AM INF reading from <span class="hljs-built_in">source</span> file system (gs)
11:04AM DBG reading datastream from gs://sling-bucket/test_file.csv
11:04AM INF writing to target database [mode: full-refresh]
11:04AM DBG drop table <span class="hljs-keyword">if</span> exists public.my_table_tmp cascade
11:04AM DBG table public.my_table_tmp dropped
11:04AM DBG create table <span class="hljs-keyword">if</span> not exists public.my_table_tmp (<span class="hljs-string">"id"</span> bigint,
<span class="hljs-string">"first_name"</span> text,
<span class="hljs-string">"last_name"</span> text,
<span class="hljs-string">"email"</span> text,
<span class="hljs-string">"target"</span> bool,
<span class="hljs-string">"create_dt"</span> timestamp,
<span class="hljs-string">"rating"</span> numeric)
11:04AM INF streaming data
11:04AM DBG select count(1) cnt from public.my_table_tmp
11:04AM DBG drop table <span class="hljs-keyword">if</span> exists public.my_table cascade
11:04AM DBG table public.my_table dropped
11:04AM INF dropped table public.my_table
11:04AM DBG create table <span class="hljs-keyword">if</span> not exists public.my_table (<span class="hljs-string">"id"</span> <span class="hljs-built_in">integer</span>,
<span class="hljs-string">"first_name"</span> varchar(255),
<span class="hljs-string">"last_name"</span> varchar(255),
<span class="hljs-string">"email"</span> varchar(255),
<span class="hljs-string">"target"</span> bool,
<span class="hljs-string">"create_dt"</span> timestamp,
<span class="hljs-string">"rating"</span> numeric)
11:04AM INF created table public.my_table
11:04AM DBG insert into <span class="hljs-string">"public"</span>.<span class="hljs-string">"my_table"</span> (<span class="hljs-string">"id"</span>, <span class="hljs-string">"first_name"</span>, <span class="hljs-string">"last_name"</span>, <span class="hljs-string">"email"</span>, <span class="hljs-string">"target"</span>, <span class="hljs-string">"create_dt"</span>, <span class="hljs-string">"rating"</span>) select <span class="hljs-string">"id"</span>, <span class="hljs-string">"first_name"</span>, <span class="hljs-string">"last_name"</span>, <span class="hljs-string">"email"</span>, <span class="hljs-string">"target"</span>, <span class="hljs-string">"create_dt"</span>, <span class="hljs-string">"rating"</span> from <span class="hljs-string">"public"</span>.<span class="hljs-string">"my_table_tmp"</span>
11:04AM DBG inserted rows into `public.my_table` from temp table `public.my_table_tmp`
11:04AM INF inserted 18 rows <span class="hljs-keyword">in</span> 1 secs [17 r/s]
11:04AM DBG drop table <span class="hljs-keyword">if</span> exists public.my_table_tmp cascade
11:04AM DBG table public.my_table_tmp dropped
11:04AM INF execution succeeded
</code></pre>
<p>Notice that since we enabled debug mode with <code>-d</code>, we can see how the table DDL has been automatically determined:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> public.my_table (<span class="hljs-string">"id"</span> <span class="hljs-built_in">integer</span>,
<span class="hljs-string">"first_name"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"last_name"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"email"</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">255</span>),
<span class="hljs-string">"target"</span> <span class="hljs-built_in">bool</span>,
<span class="hljs-string">"create_dt"</span> <span class="hljs-built_in">timestamp</span>,
<span class="hljs-string">"rating"</span> <span class="hljs-built_in">numeric</span>)
</code></pre>
<p>If you'd like to see more details on configuration options, see <a target="_blank" href="https://docs.slingdata.io/sling-cli/configuration">here</a>. If you are thinking of synching many tables, it's probably best to use a <a target="_blank" href="https://docs.slingdata.io/sling-cli/replication">replication configuration file</a>, which is also a YAML file.</p>
<h2 id="heading-ingesting-a-local-tsv-file-via-stdin-into-a-snowflake-database">Ingesting a Local TSV File (via stdin) into a Snowflake Database</h2>
<p>Ingesting a file via <code>stdin</code> is super easy. As expected, you can just pipe your file stream Unix style and execute a <code>sling run</code> command specifying the target connection and destination object details.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># let's first test our connection</span>
$ sling conns <span class="hljs-built_in">test</span> SNOWFLAKE
11:09AM INF success!

<span class="hljs-comment"># Let's pipe our file into a new table</span>
$ cat /tmp/accounts.tsv | sling run --tgt-conn SNOWFLAKE --tgt-object sling.accounts --mode full-refresh
11:09AM INF connecting to target database (snowflake)
11:09AM INF reading from stream (stdin)
11:09AM INF writing to target database [mode: full-refresh]
11:09AM INF streaming data
11:09AM INF dropped table sling.accounts
11:09AM INF created table sling.accounts
11:09AM INF inserted 94 rows <span class="hljs-keyword">in</span> 5 secs
11:09AM INF execution succeeded
</code></pre>
<p>So notice how we didn't specify the delimiter. Whether it's a comma delimited file (CSV) or a tab delimited file (TSV), sling will attempt to auto-detect the delimiter (when it's a <code>|</code>, <code>,</code>, <code>&lt;tab&gt;</code> or <code>;</code>). If you'd like to manually specify the delimiter, you can do so as shown below by adding the <code>--src-options</code> flag:</p>
<pre><code class="lang-bash">$ cat /tmp/accounts.tsv | sling run --tgt-conn SNOWFLAKE --tgt-object sling.accounts --mode full-refresh --src-options <span class="hljs-string">'{"delimiter": "\t"}'</span>
11:09AM INF connecting to target database (snowflake)
11:09AM INF reading from stream (stdin)
11:09AM INF writing to target database [mode: full-refresh]
11:09AM INF streaming data
11:09AM INF dropped table sling.accounts
11:09AM INF created table sling.accounts
11:09AM INF inserted 94 rows <span class="hljs-keyword">in</span> 5 secs
11:09AM INF execution succeeded
</code></pre>
<p>To get a full list of advanced options, please <a target="_blank" href="https://docs.slingdata.io/sling-cli/configuration#advanced-configuration">head over here</a>.</p>
<h2 id="heading-loading-digitalocean-spaces-s3-json-files-into-a-bigquery-warehouse">Loading DigitalOcean Spaces (S3) JSON files into a BigQuery Warehouse</h2>
<p>In our third example, we will be ingesting JSON files from a DigitalOcean Spaces bucket into a BigQuery instance. Let us first discover our files while applying a filter for a specific folder with the <code>--folder</code> flag.</p>
<pre><code class="lang-bash">$ sling conns discover DO_SPACES --folder s3://sling-bucket/02-05-2022
4:36PM INF Found 21 streams:
 - s3://sling-bucket/02-05-2022/1644049120.1uXKxCrhN2WGAt2fojy6k2fqDSb.26879111-e935-4957-a99e-6b6241e6a84a.json.gz
 - s3://sling-bucket/02-05-2022/1644049299.1uXKxCrhN2WGAt2fojy6k2fqDSb.0185f4a2-fe46-4ea6-bb2b-58cee66fa14d.json.gz
 - s3://sling-bucket/02-05-2022/1644049390.1uXKxCrhN2WGAt2fojy6k2fqDSb.323c8d1b-bc66-4d17-8a26-6fde546cf54f.json.gz
 - s3://sling-bucket/02-05-2022/1644049480.1uXKxCrhN2WGAt2fojy6k2fqDSb.73cf9b8c-be06-431d-98bd-c440932d530d.json.gz
 - s3://sling-bucket/02-05-2022/1644049571.1uXKxCrhN2WGAt2fojy6k2fqDSb.68fbbf20-8a87-4836-95c5-5a4b762c97cb.json.gz
 - s3://sling-bucket/02-05-2022/1644049662.1uXKxCrhN2WGAt2fojy6k2fqDSb.23397ceb-c55b-46d9-bdf7-063ce6c54cdd.json.gz
 - s3://sling-bucket/02-05-2022/1644049752.1uXKxCrhN2WGAt2fojy6k2fqDSb.d23a1079-330e-4009-a18e-ca3ac1d087f2.json.gz
 - s3://sling-bucket/02-05-2022/1644049933.1uXKxCrhN2WGAt2fojy6k2fqDSb.be115fa8-4a75-4353-a5d7-9b1ec9e0e792.json.gz
 - s3://sling-bucket/02-05-2022/1644050114.1uXKxCrhN2WGAt2fojy6k2fqDSb.191e971e-7a18-4faa-8ccd-ddca99fe934c.json.gz
 - s3://sling-bucket/02-05-2022/1644063457.1uXKxCrhN2WGAt2fojy6k2fqDSb.e68674d0-d14c-4277-aef2-88bf778f6bb5.json.gz
 - s3://sling-bucket/02-05-2022/1644063818.1uXKxCrhN2WGAt2fojy6k2fqDSb.8545e07c-3881-4dcd-9976-b860ed85d7c9.json.gz
 - s3://sling-bucket/02-05-2022/1644064268.1uXKxCrhN2WGAt2fojy6k2fqDSb.695c8516-6d8c-46ce-ac01-9d89b4dd13e8.json.gz
 - s3://sling-bucket/02-05-2022/1644064359.1uXKxCrhN2WGAt2fojy6k2fqDSb.cd481db0-2066-4179-ba8c-8dd5f45e190e.json.gz
 - s3://sling-bucket/02-05-2022/1644097266.1uXKxCrhN2WGAt2fojy6k2fqDSb.7b06270d-5c24-40fe-958b-f2175f8018f3.json.gz
 - s3://sling-bucket/02-05-2022/1644097446.1uXKxCrhN2WGAt2fojy6k2fqDSb.18e17223-b55e-4283-a7ce-84af1f525435.json.gz
 - s3://sling-bucket/02-05-2022/1644097537.1uXKxCrhN2WGAt2fojy6k2fqDSb.4bd41934-7be2-4b10-8539-580c54092f73.json.gz
 - s3://sling-bucket/02-05-2022/1644097628.1uXKxCrhN2WGAt2fojy6k2fqDSb.49facabd-6a6f-410d-b4fc-2aebf8ccdad8.json.gz
 - s3://sling-bucket/02-05-2022/1644098354.1uXKxCrhN2WGAt2fojy6k2fqDSb.865bf8b5-7361-4779-8199-6ec20543058d.json.gz
 - s3://sling-bucket/02-05-2022/1644099094.1uXKxCrhN2WGAt2fojy6k2fqDSb.2af2cfb4-8371-4938-b560-e282f8c94d16.json.gz
 - s3://sling-bucket/02-05-2022/1644099185.1uXKxCrhN2WGAt2fojy6k2fqDSb.538e1f27-306f-4942-b79d-c012b7e9518f.json.gz
 - s3://sling-bucket/02-05-2022/1644099275.1uXKxCrhN2WGAt2fojy6k2fqDSb.55cf698c-41c7-4f1b-89df-2567506cd02b.json.gz
</code></pre>
<p>Great, now let's ingest all 21 JSON files in that folder, without flattening the nested columns (with <code>debug</code> mode enabled, flag <code>-d</code>):</p>
<pre><code class="lang-bash">$ sling run -d --src-conn DO_SPACES --src-stream s3://sling-bucket/02-05-2022/ --tgt-conn BIGQUERY --tgt-object public.json_files --mode full-refresh
4:39PM DBG <span class="hljs-built_in">type</span> is file-db
4:39PM INF connecting to target database (bigquery)
4:39PM INF reading from <span class="hljs-built_in">source</span> file system (s3)
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049120.1uXKxCrhN2WGAt2fojy6k2fqDSb.26879111-e935-4957-a99e-6b6241e6a84a.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049299.1uXKxCrhN2WGAt2fojy6k2fqDSb.0185f4a2-fe46-4ea6-bb2b-58cee66fa14d.json.gz
4:39PM INF writing to target database [mode: full-refresh]
4:39PM DBG drop table <span class="hljs-keyword">if</span> exists public.json_files_tmp
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049390.1uXKxCrhN2WGAt2fojy6k2fqDSb.323c8d1b-bc66-4d17-8a26-6fde546cf54f.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049480.1uXKxCrhN2WGAt2fojy6k2fqDSb.73cf9b8c-be06-431d-98bd-c440932d530d.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049571.1uXKxCrhN2WGAt2fojy6k2fqDSb.68fbbf20-8a87-4836-95c5-5a4b762c97cb.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049662.1uXKxCrhN2WGAt2fojy6k2fqDSb.23397ceb-c55b-46d9-bdf7-063ce6c54cdd.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049752.1uXKxCrhN2WGAt2fojy6k2fqDSb.d23a1079-330e-4009-a18e-ca3ac1d087f2.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049933.1uXKxCrhN2WGAt2fojy6k2fqDSb.be115fa8-4a75-4353-a5d7-9b1ec9e0e792.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644050114.1uXKxCrhN2WGAt2fojy6k2fqDSb.191e971e-7a18-4faa-8ccd-ddca99fe934c.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644063457.1uXKxCrhN2WGAt2fojy6k2fqDSb.e68674d0-d14c-4277-aef2-88bf778f6bb5.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644063818.1uXKxCrhN2WGAt2fojy6k2fqDSb.8545e07c-3881-4dcd-9976-b860ed85d7c9.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644064268.1uXKxCrhN2WGAt2fojy6k2fqDSb.695c8516-6d8c-46ce-ac01-9d89b4dd13e8.json.gz
4:39PM DBG table public.json_files_tmp dropped
4:39PM DBG create table public.json_files_tmp (`data` json)
4:39PM INF streaming data
4:39PM INF importing into bigquery via <span class="hljs-built_in">local</span> storage
4:39PM DBG writing to /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/bigquery/public.json_files_tmp/2022-11-26T163956.037
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644064359.1uXKxCrhN2WGAt2fojy6k2fqDSb.cd481db0-2066-4179-ba8c-8dd5f45e190e.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097266.1uXKxCrhN2WGAt2fojy6k2fqDSb.7b06270d-5c24-40fe-958b-f2175f8018f3.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097446.1uXKxCrhN2WGAt2fojy6k2fqDSb.18e17223-b55e-4283-a7ce-84af1f525435.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097537.1uXKxCrhN2WGAt2fojy6k2fqDSb.4bd41934-7be2-4b10-8539-580c54092f73.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097628.1uXKxCrhN2WGAt2fojy6k2fqDSb.49facabd-6a6f-410d-b4fc-2aebf8ccdad8.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644098354.1uXKxCrhN2WGAt2fojy6k2fqDSb.865bf8b5-7361-4779-8199-6ec20543058d.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644099094.1uXKxCrhN2WGAt2fojy6k2fqDSb.2af2cfb4-8371-4938-b560-e282f8c94d16.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644099185.1uXKxCrhN2WGAt2fojy6k2fqDSb.538e1f27-306f-4942-b79d-c012b7e9518f.json.gz
4:39PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644099275.1uXKxCrhN2WGAt2fojy6k2fqDSb.55cf698c-41c7-4f1b-89df-2567506cd02b.json.gz
4:39PM DBG Loading /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/bigquery/public.json_files_tmp/2022-11-26T163956.037/part.01.0001.csv.gz
4:40PM DBG select count(1) cnt from public.json_files_tmp
4:40PM DBG drop table <span class="hljs-keyword">if</span> exists public.json_files
4:40PM DBG table public.json_files dropped
4:40PM INF dropped table public.json_files
4:40PM DBG create table public.json_files (`data` json)
4:40PM INF created table public.json_files
4:40PM DBG insert into `public`.`json_files` (`data`) select `data` from `public`.`json_files_tmp`
4:40PM DBG inserted rows into `public.json_files` from temp table `public.json_files_tmp`
4:40PM INF inserted 27 rows <span class="hljs-keyword">in</span> 22 secs [1 r/s]
4:40PM DBG drop table <span class="hljs-keyword">if</span> exists public.json_files_tmp
4:40PM DBG table public.json_files_tmp dropped
4:40PM INF execution succeeded
</code></pre>
<p>With can notice that there was 1 column in the created table, since by default <code>sling</code> does not flatten the nested column values:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> public.json_files (<span class="hljs-string">`data`</span> <span class="hljs-keyword">json</span>)
</code></pre>
<p>Now let's ingest all files again, this time adding the <code>{"flatten": true}</code> for the <code>--src-options</code> flag in order to flatten the nested columns into their own columns:</p>
<pre><code class="lang-bash">$ sling run -d --src-conn DO_SPACES --src-stream s3://sling-bucket/02-05-2022/ --src-options <span class="hljs-string">'{"flatten": true}'</span> --tgt-conn BIGQUERY --tgt-object public.json_files --mode full-refresh
4:49PM DBG <span class="hljs-built_in">type</span> is file-db
4:49PM INF connecting to target database (bigquery)
4:49PM INF reading from <span class="hljs-built_in">source</span> file system (s3)
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049120.1uXKxCrhN2WGAt2fojy6k2fqDSb.26879111-e935-4957-a99e-6b6241e6a84a.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049299.1uXKxCrhN2WGAt2fojy6k2fqDSb.0185f4a2-fe46-4ea6-bb2b-58cee66fa14d.json.gz
4:49PM INF writing to target database [mode: full-refresh]
4:49PM DBG drop table <span class="hljs-keyword">if</span> exists public.json_files_tmp
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049390.1uXKxCrhN2WGAt2fojy6k2fqDSb.323c8d1b-bc66-4d17-8a26-6fde546cf54f.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049480.1uXKxCrhN2WGAt2fojy6k2fqDSb.73cf9b8c-be06-431d-98bd-c440932d530d.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049571.1uXKxCrhN2WGAt2fojy6k2fqDSb.68fbbf20-8a87-4836-95c5-5a4b762c97cb.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049662.1uXKxCrhN2WGAt2fojy6k2fqDSb.23397ceb-c55b-46d9-bdf7-063ce6c54cdd.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049752.1uXKxCrhN2WGAt2fojy6k2fqDSb.d23a1079-330e-4009-a18e-ca3ac1d087f2.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644049933.1uXKxCrhN2WGAt2fojy6k2fqDSb.be115fa8-4a75-4353-a5d7-9b1ec9e0e792.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644050114.1uXKxCrhN2WGAt2fojy6k2fqDSb.191e971e-7a18-4faa-8ccd-ddca99fe934c.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644063457.1uXKxCrhN2WGAt2fojy6k2fqDSb.e68674d0-d14c-4277-aef2-88bf778f6bb5.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644063818.1uXKxCrhN2WGAt2fojy6k2fqDSb.8545e07c-3881-4dcd-9976-b860ed85d7c9.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644064268.1uXKxCrhN2WGAt2fojy6k2fqDSb.695c8516-6d8c-46ce-ac01-9d89b4dd13e8.json.gz
4:49PM DBG table public.json_files_tmp dropped
4:49PM DBG create table public.json_files_tmp (`anonymousid` string,
`context__library__name` string,
`context__library__version` string,
`event` string,
`messageid` string,
`originaltimestamp` timestamp,
`properties__application` string,
`properties__cmd` string,
`properties__error` string,
`properties__job_end_time` timestamp,
`properties__job_mode` string,
`properties__job_rows_count` int64,
`properties__job_src_type` string,
`properties__job_start_time` timestamp,
`properties__job_status` string,
`properties__job_tgt_type` string,
`properties__job_type` string,
`properties__os` string,
`properties__version` string,
`receivedat` timestamp,
`request_ip` string,
`rudderid` string,
`sentat` timestamp,
`timestamp` string,
`<span class="hljs-built_in">type</span>` string,
`userid` string)
4:49PM INF streaming data
4:49PM INF importing into bigquery via <span class="hljs-built_in">local</span> storage
4:49PM DBG writing to /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/bigquery/public.json_files_tmp/2022-11-26T163956.037
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644064359.1uXKxCrhN2WGAt2fojy6k2fqDSb.cd481db0-2066-4179-ba8c-8dd5f45e190e.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097266.1uXKxCrhN2WGAt2fojy6k2fqDSb.7b06270d-5c24-40fe-958b-f2175f8018f3.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097446.1uXKxCrhN2WGAt2fojy6k2fqDSb.18e17223-b55e-4283-a7ce-84af1f525435.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097537.1uXKxCrhN2WGAt2fojy6k2fqDSb.4bd41934-7be2-4b10-8539-580c54092f73.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644097628.1uXKxCrhN2WGAt2fojy6k2fqDSb.49facabd-6a6f-410d-b4fc-2aebf8ccdad8.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644098354.1uXKxCrhN2WGAt2fojy6k2fqDSb.865bf8b5-7361-4779-8199-6ec20543058d.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644099094.1uXKxCrhN2WGAt2fojy6k2fqDSb.2af2cfb4-8371-4938-b560-e282f8c94d16.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644099185.1uXKxCrhN2WGAt2fojy6k2fqDSb.538e1f27-306f-4942-b79d-c012b7e9518f.json.gz
4:49PM DBG reading datastream from s3://sling-bucket/02-05-2022/1644099275.1uXKxCrhN2WGAt2fojy6k2fqDSb.55cf698c-41c7-4f1b-89df-2567506cd02b.json.gz
4:49PM DBG Loading /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/bigquery/public.json_files_tmp/2022-11-26T163956.037/part.01.0001.csv.gz
4:50PM DBG select count(1) cnt from public.json_files_tmp
4:50PM DBG drop table <span class="hljs-keyword">if</span> exists public.json_files
4:50PM DBG table public.json_files dropped
4:50PM INF dropped table public.json_files
4:50PM DBG create table public.json_files (`anonymousid` string,
`context__library__name` string,
`context__library__version` string,
`event` string,
`messageid` string,
`originaltimestamp` timestamp,
`properties__application` string,
`properties__cmd` string,
`properties__error` string,
`properties__job_end_time` timestamp,
`properties__job_mode` string,
`properties__job_rows_count` int64,
`properties__job_src_type` string,
`properties__job_start_time` timestamp,
`properties__job_status` string,
`properties__job_tgt_type` string,
`properties__job_type` string,
`properties__os` string,
`properties__version` string,
`receivedat` timestamp,
`request_ip` string,
`rudderid` string,
`sentat` timestamp,
`timestamp` string,
`<span class="hljs-built_in">type</span>` string,
`userid` string)
4:50PM INF created table public.json_files
4:50PM DBG inserted rows into `public.json_files` from temp table `public.json_files_tmp`
4:50PM INF inserted 27 rows <span class="hljs-keyword">in</span> 22 secs [1 r/s]
4:50PM DBG drop table <span class="hljs-keyword">if</span> exists public.json_files_tmp
4:50PM DBG table public.json_files_tmp dropped
4:50PM INF execution succeeded
</code></pre>
<p>With can notice that there were many columns in the created table, for each nested key. There is a <code>__</code> separator between the parent / child columns. Nifty huh!?</p>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> public.json_files (<span class="hljs-string">`anonymousid`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`context__library__name`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`context__library__version`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`event`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`messageid`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`originaltimestamp`</span> <span class="hljs-built_in">timestamp</span>,
<span class="hljs-string">`properties__application`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__cmd`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__error`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__job_end_time`</span> <span class="hljs-built_in">timestamp</span>,
<span class="hljs-string">`properties__job_mode`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__job_rows_count`</span> int64,
<span class="hljs-string">`properties__job_src_type`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__job_start_time`</span> <span class="hljs-built_in">timestamp</span>,
<span class="hljs-string">`properties__job_status`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__job_tgt_type`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__job_type`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__os`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`properties__version`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`receivedat`</span> <span class="hljs-built_in">timestamp</span>,
<span class="hljs-string">`request_ip`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`rudderid`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`sentat`</span> <span class="hljs-built_in">timestamp</span>,
<span class="hljs-string">`timestamp`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`type`</span> <span class="hljs-keyword">string</span>,
<span class="hljs-string">`userid`</span> <span class="hljs-keyword">string</span>)
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>We have demonstrated how easy it is to use the Sling CLI tool to load files from various storage sources. There are even more possibilities and configurations, check out our <a target="_blank" href="https://docs.slingdata.io">docs site</a> for more details. If you have any questions, issues or comments, feel free to email us at <code>support</code> @ <code>slingdata.io</code>.</p>
]]></content:encoded></item><item><title><![CDATA[Exporting data from BigTable and Loading it into your Data Warehouse]]></title><description><![CDATA[BigTable
You may not have heard about Google’s BigTable but it is a thriving NoSQL database offering from the juggernaut, allowing exceptional read/write performance when properly tuned. According to Google Cloud’s own blog post, BigTable manages ove...]]></description><link>https://blog.slingdata.io/bigtable-to-data-warehouse</link><guid isPermaLink="true">https://blog.slingdata.io/bigtable-to-data-warehouse</guid><category><![CDATA[bigtable]]></category><category><![CDATA[ETL]]></category><category><![CDATA[ELT]]></category><category><![CDATA[snowflake]]></category><category><![CDATA[bigquery]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Wed, 23 Nov 2022 19:59:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/L4gN0aeaPY4/upload/v1669231835077/1uhjn-dCH.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-bigtable">BigTable</h2>
<p>You may not have heard about <a target="_blank" href="https://cloud.google.com/bigtable">Google’s BigTable</a> but it is a thriving <a target="_blank" href="https://en.wikipedia.org/wiki/NoSQL">NoSQL</a> database offering from the juggernaut, allowing exceptional read/write performance when properly tuned. According to Google Cloud’s own <a target="_blank" href="https://cloud.google.com/blog/products/databases/cloud-bigtable-now-even-easier-to-manage-with-autoscaling">blog post</a>, BigTable manages over 10 Exabytes of data and serves more than 5 billion requests per second. It also offers various features such as auto-scaling with the goal of optimizing costs and improved manageability.</p>
<h2 id="heading-extract-load-process">Extract-Load Process</h2>
<p>So you have a situation where you need to copy or move data out of BigTable and import it into your Data Warehouse. How to proceed? Through various Google search, it is apparent that there is a lack of tooling that properly handle BigTable as a source. Fortunately, this is not the case for Sling. The go-powered Extract-Load (EL) tool, allows you to natively connect to BigTable and export the unstructured data sets in a tabular format (CSV, TSV) or into the most popular RDBMS databases such as PostgreSQL, BigQuery, Redshift, SQL Server and Snowflake. See below for an illustration of how Sling does this when importing into Snowflake.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668383054901/obaHTSRpb.png" alt="image.png" /></p>
<h3 id="heading-using-sling-cli">Using Sling CLI</h3>
<p>Sling CLI is a command line tool (as the name suggests), which allows you great flexibility for custom automation, at the expense of doing things a little bit more manually. The first step is to install sling. Since it is built in Go, it is offered as a binary for whatever system you are using.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># Using Python Wrapper via pip</span>
pip install sling
</code></pre>
<p>If you have a Linux system or desire to download the binary manually , please head over <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started#binary-download">here</a>. Once you have it downloaded, the next step is to set your credentials. Sling primarily uses a <code>env.yaml</code> file located in the <code>~/.sling/</code> folder. Here is an example:</p>
<details>
<summary><strong>~/.sling/env.yaml</strong></summary>
<pre><code>connections:

  MY_BIGTABLE:
    type: bigtable
    project: sling-project-123
    location: US
    instance: big-table-instance
    gc_key_file: /Users/me/.sling/sling-project-123-ce219ceaef9512.json

  MY_SNOWFLAKE:
    type: snowflake
    username: fritz
    password: my_pass23
    account: abc123456.us-east-1
    database: sling
    schema: public</code></pre>
</details>

<p>Once the credentials are set, we can test connectivity with the <code>sling conns test</code> command:</p>
<pre><code class="lang-bash">$ sling conns list
+---------------+----------------+----------------+
| CONN NAME     | CONN TYPE      | SOURCE         |
+---------------+----------------+----------------+
| MY_BIGTABLE   | DB - BigTable  | sling env yaml |
| MY_SNOWFLAKE  | DB - Snowflake | sling env yaml |
+---------------+----------------+----------------+

$ sling conns <span class="hljs-built_in">test</span> MY_BIGTABLE
5:13PM INF success!

$ sling conns <span class="hljs-built_in">test</span> MY_SNOWFLAKE
5:14PM INF success!
</code></pre>
<p>Great, now we are ready to run our Extract-Load (EL) task.</p>
<pre><code class="lang-bash">$ sling run --src-conn MY_BIGTABLE --src-stream test_table3 --tgt-conn MY_SNOWFLAKE --tgt-object public.test_table3 --mode full-refresh
0:08AM INF connecting to <span class="hljs-built_in">source</span> database (bigtable)
10:08AM INF connecting to target database (snowflake)
10:08AM INF reading from <span class="hljs-built_in">source</span> database
10:08AM INF writing to target database [mode: full-refresh]
10:09AM INF streaming data
1s 10,000 33798 r/s
10:09AM INF dropped table public.test_table3
10:09AM INF created table public.test_table3
10:09AM INF inserted 10000 rows <span class="hljs-keyword">in</span> 11 secs [901 r/s]
10:09AM INF execution succeeded
</code></pre>
<p>To learn more about Sling CLI, please see <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We have demonstrated an easy way to move your data from BigTable to Snowflake. Similar steps can be used for any other destination databases supported by Sling. To see a list of compatible databases, please visit <a target="_blank" href="https://docs.slingdata.io/connections/database-connections">this page</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Exporting Data from BigQuery to Snowflake, the Easy Way]]></title><description><![CDATA[Cloud Data Warehouses
In the past few years, we've seen a rapid growth in the usage of cloud data warehouses (as well as the "warehouse-first" paradigm). Two popular cloud DWH platforms are BigQuery and Snowflake. Check out the chart below to see the...]]></description><link>https://blog.slingdata.io/bigquery-to-snowflake</link><guid isPermaLink="true">https://blog.slingdata.io/bigquery-to-snowflake</guid><category><![CDATA[bigquery]]></category><category><![CDATA[data integration]]></category><category><![CDATA[ELT]]></category><category><![CDATA[ETL]]></category><category><![CDATA[extract]]></category><category><![CDATA[load]]></category><category><![CDATA[Sling]]></category><category><![CDATA[snowflake]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Fri, 11 Nov 2022 20:51:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1668200665148/SATSDoyfO.jpeg?auto=compress" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-cloud-data-warehouses">Cloud Data Warehouses</h2>
<p>In the past few years, we've seen a rapid growth in the usage of cloud data warehouses (as well as the "warehouse-first" paradigm). Two popular cloud DWH platforms are BigQuery and Snowflake. Check out the chart below to see their evolution over time.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668200756198/TKaTCq2bG.jpeg?auto=compress" alt="Image: Gartner via Adam Ronthal (@aronthal) on Twitter." />
<em>Image: Gartner via Adam Ronthal (@aronthal) on <a target="_blank" href="https://twitter.com/ARonthal/status/1514595630072619014/photo/1">Twitter</a>.</em></p>
<p>BigQuery, standing at #4 as of 2021, is a fully-managed, serverless data warehouse service offered by Google Cloud Platform (GCP). It enables easy and scalable analysis over petabytes of data and has long been known for its ease of use and maintenance-free nature.</p>
<p>Snowflake, is a similar service offered by the company Snowflake Inc. One of the principal differences is that Snowflake allows you to host the instance in either Amazon Web Services (AWS), Azure (Microsoft) or GCP (Google). This is a great advantage if you are already established in a non-GCP environment.</p>
<h2 id="heading-exporting-and-loading-the-data">Exporting and Loading the data</h2>
<p>As circumstances have it, it is sometimes necessary or desired to copy data from a BigQuery environment into a Snowflake environment. Let's take a look and break down the various logical steps needed to properly move this data around since neither competing services have an integrated function to easily do this. For the sake of our example, we will assume that our destination Snowflake environment is hosted on AWS.</p>
<h3 id="heading-step-by-step-procedure">Step By Step Procedure</h3>
<p>In order to migrate data from BigQuery to Snowflake (AWS), these are the essential steps:</p>
<ol>
<li>Identify table or query and execute <code>EXPORT DATA OPTIONS</code> query to export to Google Cloud Storage (GCS).</li>
<li>Run script in VM or local machine to copy GCS data to Snowflake's Internal Stage. We could also read straight from GCS with a storage integration, but this involves another layer of secure access configuration (which may be preferable for your use case).</li>
<li>Manually generate <code>CREATE TABLE</code> DDL with correct column data types and execute in Snowflake.</li>
<li>Execute a <code>COPY</code> query in Snowflake to import staged files.</li>
<li>Optionally clean up (delete) temporary data in GCP and Internal Stage.</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668201625003/9l4DKdJj9.png" alt />
<em>Image: Steps to manually export from BigQuery to Snowflake.</em></p>
<p>As demonstrated above, there are several steps to make this happen, where independent systems need to be interacted with. This can be cumbersome to automate, especially generating the correct DDL (#3) with the proper column types in the destination system (which I personally find the most burdensome, try doing this for tables with 50+ columns).</p>
<p>Fortunately, there is an easier way to do this, and it is by using a nifty tool called <em>Sling</em>. Sling is a data integration tool which allows easy and efficient movement of data (Extract &amp; Load) from/to Databases and Storage Platforms. There are two ways of using it: Sling CLI &amp; Sling Cloud. We will do the same procedure as above, but only by providing inputs to sling and it will automatically do the intricate steps for us!</p>
<h3 id="heading-using-sling-cli">Using Sling CLI</h3>
<p>If you are a fanatic of the command line, Sling CLI is for you. It is built in <code>go</code> (which makes it super-fast), and it works with files and databases. It can also work with Unix Pipes (reads standard-input and writes to standard out). We can quickly install it from our shell:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># On Mac</span>
brew install slingdata-io/sling/sling

<span class="hljs-comment"># On Windows Powershell</span>
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

<span class="hljs-comment"># Using Python Wrapper via pip</span>
pip install sling
</code></pre>
<p>Please see <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a> for other installation options (including Linux). There is also a <a target="_blank" href="https://pypi.org/project/sling/">Python wrapper</a> library, which is useful if you prefer interacting with Sling inside of Python.</p>
<p>Once installed, we should be able to run the <code>sling</code> command, which should give us this output:</p>
<pre><code class="lang-bash">sling - An Extract-Load tool | https://slingdata.io
Slings data from a data <span class="hljs-built_in">source</span> to a data target.
Version 0.86.52

  Usage:
    sling [conns|run|update]

  Subcommands:
    conns    Manage <span class="hljs-built_in">local</span> connections
    run      Execute an ad-hoc task
    update   Update Sling to the latest version

  Flags:
       --version   Displays the program version string.
    -h --<span class="hljs-built_in">help</span>      Displays <span class="hljs-built_in">help</span> with available flag, subcommand, and positional value parameters.
</code></pre>
<p>Now there are many ways to <a target="_blank" href="https://docs.slingdata.io/sling-cli/configuration">configure tasks</a>, but for our scope in this article, we first need to add connections credentials for BigQuery and Snowflake (a one time chore). We can do this by opening the file <code>~/.sling/env.yaml</code>, and adding the credentials, which should look like this:</p>
<details>
<summary><strong>~/.sling/env.yaml</strong></summary>

<pre><code>connections:

  BIGQUERY:
    type: bigquery
    project: sling-project-123
    location: US
    dataset: public
    gc_key_file: ~/.sling/sling-project-123-ce219ceaef9512.json
    gc_bucket: sling_us_bucket # this is optional but recommended for bulk export. 

  SNOWFLAKE:
    type: snowflake
    username: fritz
    password: my_pass23
    account: abc123456.us-east-1
    database: sling
    schema: public</code></pre>

</details>

<p>Great, now let's test our connections:</p>
<pre><code class="lang-bash">$ sling conns list
+------------+------------------+-----------------+
| CONN NAME  | CONN TYPE        | SOURCE          |
+------------+------------------+-----------------+
| BIGQUERY   | DB - Snowflake   | sling env yaml  |
| SNOWFLAKE  | DB - PostgreSQL  | sling env yaml  |
+------------+------------------+-----------------+

$ sling conns <span class="hljs-built_in">test</span> BIGQUERY
6:42PM INF success!

$ sling conns <span class="hljs-built_in">test</span> SNOWFLAKE
6:42PM INF success!
</code></pre>
<p>Fantastic, now that we have our connections setup, we can run our task:</p>
<pre><code class="lang-bash">$ sling run --src-conn BIGQUERY --src-stream <span class="hljs-string">"select user.name, activity.* from public.activity join public.user on user.id = activity.user_id where user.type != 'external'"</span> --tgt-conn SNOWFLAKE --tgt-object <span class="hljs-string">'public.activity_user'</span> --mode full-refresh
11:37AM INF connecting to <span class="hljs-built_in">source</span> database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from <span class="hljs-built_in">source</span> database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_user
11:38AM INF created table public.activity_user
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded
</code></pre>
<p>Wow, that was easy! Sling did all the steps that we described prior automatically. We can even export the Snowflake data back to our shell sdtout (in CSV format) by providing just the table identifier (<code>public.activity_user</code>) for the <code>--src-stream</code> flag and count the lines to validate our data:</p>
<pre><code class="lang-bash">$ sling run --src-conn SNOWFLAKE --src-stream public.activity_user --stdout | wc -l
11:39AM INF connecting to <span class="hljs-built_in">source</span> database (snowflake)
11:39AM INF reading from <span class="hljs-built_in">source</span> database
11:39AM INF writing to target stream (stdout)
11:39AM INF wrote 77668 rows
11:39AM INF execution succeeded
77669 <span class="hljs-comment"># CSV output includes a header row (77668 + 1)</span>
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We are in an era where data is gold, and moving data from one platform to another shouldn't be difficult. As we have demonstrated, Sling offers a powerful alternative by reducing friction associated with data integration. We'll be covering how to export from Snowflake and loading into BigQuery in another post.</p>
]]></content:encoded></item><item><title><![CDATA[Sling CLI Connection Management]]></title><description><![CDATA[Introduction to Sling CLI
Sling CLI is a command line tool which allows easy and efficient movement of data (Extract & Load) from/to Databases and Storage Platforms. It's trivial to get started, you can simply run pip install sling if you have Python...]]></description><link>https://blog.slingdata.io/sling-cli-connection-management</link><guid isPermaLink="true">https://blog.slingdata.io/sling-cli-connection-management</guid><category><![CDATA[connection]]></category><category><![CDATA[#Connector                           ]]></category><category><![CDATA[ELT]]></category><category><![CDATA[ETL]]></category><category><![CDATA[extract]]></category><category><![CDATA[Go Language]]></category><category><![CDATA[load]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Fri, 11 Nov 2022 20:45:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1668199126999/FvZjV-PzW.png?auto=compress" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction-to-sling-cli">Introduction to Sling CLI</h2>
<p>Sling CLI is a command line tool which allows easy and efficient movement of data (Extract &amp; Load) from/to Databases and Storage Platforms. It's trivial to get started, you can simply run <code>pip install sling</code> if you have Python's pip installed. Or you can download the binary for your machine <a target="_blank" href="https://docs.slingdata.io/sling-cli/getting-started">here</a>.</p>
<h2 id="heading-connection-credentials">Connection Credentials</h2>
<p>In order to use <code>sling</code>, we must first configure connection credentials, and Sling CLI looks for them in various places. This allows a “plug &amp; play” nature if you are already using another tool such as <code>dbt</code>, or have connection URLs set in environment variables. It is however recommended to use Sling’s <code>env.yaml</code> file as it allows a more consistent and flexible experience.</p>
<h3 id="heading-sling-env-file">Sling Env File</h3>
<p>The first time you run the <code>sling</code> command, the <code>.sling</code> folder is created in the current user’s home directory (<code>~/.sling</code>), which in turn holds a file called <code>env.yaml</code>. The structure for the Sling’s Env file is simple, you put your connections’ credential under the <code>connections</code> key as shown below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">connections:</span>
  <span class="hljs-attr">marketing_pg:</span>
    <span class="hljs-attr">url:</span> <span class="hljs-string">'postgres://...'</span> 
    <span class="hljs-attr">ssh_tunnel:</span> <span class="hljs-string">'ssh://...'</span> <span class="hljs-comment"># optional</span>

  <span class="hljs-comment"># or dbt profile styled</span>
  <span class="hljs-attr">marketing_pg:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">postgres</span>        
    <span class="hljs-attr">host:</span> [<span class="hljs-string">hostname</span>]      
    <span class="hljs-attr">user:</span> [<span class="hljs-string">username</span>]      
    <span class="hljs-attr">password:</span> [<span class="hljs-string">password</span>]  
    <span class="hljs-attr">port:</span> [<span class="hljs-string">port</span>]          
    <span class="hljs-attr">dbname:</span> [<span class="hljs-string">database</span> <span class="hljs-string">name</span>]
    <span class="hljs-attr">schema:</span> [<span class="hljs-string">dbt</span> <span class="hljs-string">schema</span>]  
    <span class="hljs-attr">ssh_tunnel:</span> <span class="hljs-string">'ssh://...'</span> 

  <span class="hljs-attr">finance_bq:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">bigquery</span>
    <span class="hljs-attr">method:</span> <span class="hljs-string">service-account</span>
    <span class="hljs-attr">project:</span> [<span class="hljs-string">GCP</span> <span class="hljs-string">project</span> <span class="hljs-string">id</span>]
    <span class="hljs-attr">dataset:</span> [<span class="hljs-string">the</span> <span class="hljs-string">name</span> <span class="hljs-string">of</span> <span class="hljs-string">your</span> <span class="hljs-string">dbt</span> <span class="hljs-string">dataset</span>]
    <span class="hljs-attr">keyfile:</span> [<span class="hljs-string">/path/to/bigquery/keyfile.json</span>]

<span class="hljs-comment"># global variables, available to all connections at runtime (optional)</span>
<span class="hljs-attr">variables:</span>
  <span class="hljs-attr">aws_access_key:</span> <span class="hljs-string">'...'</span>
  <span class="hljs-attr">aws_secret_key:</span> <span class="hljs-string">'...'</span>
</code></pre>
<p>Please see <a target="_blank" href="https://docs.slingdata.io/connections/database-connections">here</a> for all the accepted connection types and their respective data point needed.</p>
<p>When using the <code>sling conns list</code> command with Sling Env credentials, the <code>SOURCE</code> column will show as <code>sling env yaml</code>.</p>
<h3 id="heading-environment-variables">Environment variables</h3>
<p>If you’d rather use environment variables, it suffices to set them in your shell environment the usual way:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Mac / Linux</span>
<span class="hljs-built_in">export</span> MY_PG=<span class="hljs-string">'postgresql://user:mypassw@pg.host:5432/db1'</span>
<span class="hljs-built_in">export</span> MY_SNOWFLAKE=<span class="hljs-string">'snowflake://user:mypassw@sf.host/db1'</span>
<span class="hljs-built_in">export</span> ORACLE_DB=<span class="hljs-string">'oracle://user:mypassw@orcl.host:1521/db1'</span>

<span class="hljs-comment"># Windows Powershell</span>
<span class="hljs-built_in">set</span> MY_PG <span class="hljs-string">'postgresql://user:mypassw@pg.host:5432/db1'</span>
<span class="hljs-built_in">set</span> MY_SNOWFLAKE <span class="hljs-string">'snowflake://user:mypassw@sf.host/db1'</span>
<span class="hljs-built_in">set</span> ORACLE_DB <span class="hljs-string">'oracle://user:mypassw@orcl.host:1521/db1'</span>
</code></pre>
<p>When using the <code>sling conns list</code> command with environment variables, the <code>SOURCE</code> column will show as <code>env variable</code>.</p>
<h3 id="heading-dbt-profiles">DBT Profiles</h3>
<p><code>dbt</code> is another popular tool that many data professionals use on a daily basis, and supporting existing local profiles allows easy cross-use. The typical location for the dbt credentials are in the <code>~/dbt/profiles.yml</code> file. See <a target="_blank" href="https://docs.getdbt.com/dbt-cli/configure-your-profile">here</a> for more details.</p>
<p>If you have <code>dbt</code> credentials in place and use the <code>sling conns list</code> command, the <code>SOURCE</code> column will show as <code>dbt profiles yaml</code>.</p>
<h2 id="heading-the-conns-sub-command">The <code>conns</code> Sub-Command</h2>
<p>Now that you have credentials set, sling offers a <code>conns</code> sub-command to interact with the connections. We can perform the following operations: <code>list</code>, <code>test</code> and <code>discover</code>.</p>
<pre><code class="lang-bash">$ sling conns -h
conns - Manage <span class="hljs-built_in">local</span> connections

  Usage:
    conns [discover|list|<span class="hljs-built_in">test</span>]

  Subcommands:
    discover   list available streams <span class="hljs-keyword">in</span> connection
    list       list <span class="hljs-built_in">local</span> connections detected
    <span class="hljs-built_in">test</span>       <span class="hljs-built_in">test</span> a <span class="hljs-built_in">local</span> connection

  Flags:
       --version   Displays the program version string.
    -h --<span class="hljs-built_in">help</span>      Displays <span class="hljs-built_in">help</span> with available flag, subcommand, and positional value parameters.
</code></pre>
<h3 id="heading-listing-connections">Listing Connections</h3>
<p>It's convenient to see and list all connections available in our environment. We can simply run the <code>sling conns list</code> command. Here is an example:</p>
<pre><code class="lang-bash">$ sling conns list
+----------------------+------------------+-------------------+
| CONN NAME            | CONN TYPE        | SOURCE            |
+----------------------+------------------+-------------------+
| AWS_S3               | FileSys - S3     | sling env yaml    |
| AZURE_STORAGE        | FileSys - Azure  | sling env yaml    |
| BIGQUERY             | DB - BigQuery    | sling env yaml    |
| BIONIC_DB1           | DB - PostgreSQL  | dbt profiles yaml |
| BTD_S3               | FileSys - S3     | sling env yaml    |
| CLICKHOUSE           | DB - Clickhouse  | sling env yaml    |
| DEMO_POSTGRES        | DB - PostgreSQL  | sling env yaml    |
| SNOWFLAKE            | DB - Snowflake   | env variable      |
| STEAMPIPE            | DB - PostgreSQL  | sling env yaml    |
+----------------------+------------------+-------------------+
</code></pre>
<h3 id="heading-testing-connections">Testing Connections</h3>
<p>The Sling CLI tool also allows testing connections. Once we know the connection name, we can use the <code>sling conns test</code> command:</p>
<pre><code>$ sling conns test -h
test - test a local connection

  <span class="hljs-attr">Usage</span>:
    test [name]

  Positional Variables:
    name   The name <span class="hljs-keyword">of</span> the connection to test (Required)
  <span class="hljs-attr">Flags</span>:
       --version   Displays the program version string.
    -h --help      Displays help <span class="hljs-keyword">with</span> available flag, subcommand, and positional value parameters.
</code></pre><p>Here is an actual example:</p>
<pre><code>$ sling conns test MSSQL
<span class="hljs-number">6</span>:<span class="hljs-number">42</span>PM INF success!
</code></pre><h3 id="heading-discovering-connection-streams">Discovering Connection Streams</h3>
<p>This is another nifty sub-command that allows one to see which data streams are available for <code>sling</code> is read from for a particular connection: the <code>sling conns discover</code> command.</p>
<pre><code>$ sling conns discover -h
discover - list available streams <span class="hljs-keyword">in</span> connection

  <span class="hljs-attr">Usage</span>:
    discover [name]

  Positional Variables:
    name   The name <span class="hljs-keyword">of</span> the connection to test (Required)
  <span class="hljs-attr">Flags</span>:
       --version   Displays the program version string.
    -h --help      Displays help <span class="hljs-keyword">with</span> available flag, subcommand, and positional value parameters.
    -f --filter    filter stream name by pattern (e.g. account_*)
       --folder    discover streams <span class="hljs-keyword">in</span> a specific folder (<span class="hljs-keyword">for</span> file connections)
       --schema    discover streams <span class="hljs-keyword">in</span> a specific schema (<span class="hljs-keyword">for</span> database connections)
</code></pre><p>For database connections, it will list the available tables and views. For storage connections, it will list the non-recursive file objects located in the specified source folder. Below are some examples.</p>
<h4 id="heading-database-connections">Database Connections</h4>
<pre><code>$ sling conns discover CLICKHOUSE
<span class="hljs-number">6</span>:<span class="hljs-number">57</span>PM INF Found <span class="hljs-number">68</span> streams:
 - <span class="hljs-string">"default"</span>.<span class="hljs-string">"docker_logs"</span>
 - <span class="hljs-string">"default"</span>.<span class="hljs-string">"sling_docker_logs"</span>
 - <span class="hljs-string">"system"</span>.<span class="hljs-string">"aggregate_function_combinators"</span>
 - <span class="hljs-string">"system"</span>.<span class="hljs-string">"asynchronous_metric_log"</span>
 - <span class="hljs-string">"system"</span>.<span class="hljs-string">"asynchronous_metrics"</span>
 - <span class="hljs-string">"system"</span>.<span class="hljs-string">"build_options"</span>
 - <span class="hljs-string">"system"</span>.<span class="hljs-string">"clusters"</span>
 ....
</code></pre><p>If we want to filter for a specific shema, we can do:</p>
<pre><code>$ sling conns discover CLICKHOUSE --schema <span class="hljs-keyword">default</span>
<span class="hljs-number">8</span>:<span class="hljs-number">29</span>PM INF Found <span class="hljs-number">2</span> streams:
 - <span class="hljs-string">"default"</span>.<span class="hljs-string">"docker_logs"</span>
 - <span class="hljs-string">"default"</span>.<span class="hljs-string">"sling_docker_logs"</span>
</code></pre><h4 id="heading-storage-connections">Storage Connections</h4>
<pre><code>$ sling conns discover AWS_S3
<span class="hljs-number">6</span>:<span class="hljs-number">52</span>PM INF Found <span class="hljs-number">7</span> streams:
 - s3:<span class="hljs-comment">//my-sling-bucket/logging/</span>
 - s3:<span class="hljs-comment">//my-sling-bucket/part.01.0001.csv</span>
 - s3:<span class="hljs-comment">//my-sling-bucket/sling/</span>
 - s3:<span class="hljs-comment">//my-sling-bucket/temp/</span>
 - s3:<span class="hljs-comment">//my-sling-bucket/test.fs.write/</span>
 - s3:<span class="hljs-comment">//my-sling-bucket/test/</span>
 - s3:<span class="hljs-comment">//my-sling-bucket/test_1000.csv</span>
</code></pre><p>If we want to see the files in a sub-folder, we can do this:</p>
<pre><code>$ sling conns discover AWS_S3 --folder s3:<span class="hljs-comment">//my-sling-bucket/logging/</span>
<span class="hljs-number">6</span>:<span class="hljs-number">55</span>PM INF Found <span class="hljs-number">1</span> streams:
 - s3:<span class="hljs-comment">//my-sling-bucket/logging/1/1.log.gz</span>
</code></pre><h2 id="heading-running-el-tasks">Running EL Tasks</h2>
<p>Now that your connections are set, you are ready to run some Extract and Load tasks! We cover this in detail in a separate post, you can read about it <a target="_blank" href="https://docs.slingdata.io/sling-cli/running-tasks">here</a>. From the command line, you can also run <code>sling run -e</code> which will print a bunch of examples.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>All things considered, Sling CLI makes it easy to manage and interact with various types of connections from your shell. If you have any questions or suggestions, feel free to contact us <a target="_blank" href="https://slingdata.io/contact">here</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Introduction to Sling]]></title><description><![CDATA[Sling is a go-powered, modern data integration tool Extracting and Loading data from popular data sources to destinations with high performance and ease.

Why Sling?

Blazing fast performance - Core engine is written in Go and adopts a streaming desi...]]></description><link>https://blog.slingdata.io/introduction-to-sling</link><guid isPermaLink="true">https://blog.slingdata.io/introduction-to-sling</guid><category><![CDATA[ELT]]></category><category><![CDATA[ETL]]></category><category><![CDATA[extract]]></category><category><![CDATA[Go Language]]></category><category><![CDATA[load]]></category><dc:creator><![CDATA[Fritz Larco]]></dc:creator><pubDate>Fri, 11 Nov 2022 20:34:54 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1668198162819/jq1IWtIcA.jpg?auto=compress" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Sling is a go-powered, modern data integration tool Extracting and Loading data from popular data sources to destinations with high performance and ease.</p>
</blockquote>
<h3 id="heading-why-sling">Why Sling?</h3>
<ul>
<li><strong>Blazing fast performance</strong> - Core engine is written in Go and adopts a streaming design, making it super efficient by holding minimal data in memory.</li>
<li><strong>Replicate data quickly</strong> - Easily replicate data from a source database or file connection to a destination database or file.</li>
<li><strong>Transparent &amp; Low Cost</strong> - Sling operates on an efficient and low-cost model. Our goal is to be transparent with the cost of using our platform.</li>
</ul>
<p><a target="_blank" href="https://docs.slingdata.io">Learn more about how Sling works</a></p>
]]></content:encoded></item></channel></rss>