Introducing Sling API Specs

Sling has always made it easy to move data between databases and file systems. Postgres to Snowflake, S3 to BigQuery, MySQL to Iceberg. One YAML file, one command, done. But when people asked us about pulling data from REST APIs, the answer used to be "you'll need a separate tool for that."

Not anymore.

API specs are Sling's way of describing a REST API in YAML. Once you have a spec, the API behaves like any other source. You point a replication at it, and Sling pulls the data into Postgres, Snowflake, S3, or whatever target you want. Same sling run command. Same configuration style. Same incremental sync, deduplication, and column typing you already use.

This post walks through what an API spec looks like, how it works under the hood, and how to use the official Stripe spec to load Stripe data into a database in about three minutes.

Why YAML for APIs

REST APIs all do roughly the same thing: send an HTTP request, parse the response, follow pagination links, repeat. The boring parts are nearly identical across providers. What changes is the URL pattern, the auth method, the JSON shape, and whatever pagination convention that particular vendor settled on years ago.

A spec captures those differences in one file. Once it's written, you don't think about it again.

name: "Example API"
description: "A minimal Sling API spec"

defaults:
  state:
    base_url: https://api.example.com/v1

  request:
    headers:
      Accept: "application/json"
      Authorization: "Bearer {secrets.api_token}"

endpoints:
  users:
    request:
      url: '{state.base_url}/users'
      parameters:
        page: '{state.page}'
        limit: '{state.limit}'

    state:
      page: 1
      limit: 100

    pagination:
      next_state:
        page: '{state.page + 1}'
      stop_condition: "length(response.records) < state.limit"

    response:
      records:
        jmespath: "data.users[]"
        primary_key: ["id"]

A few things to notice:

defaults applies to every endpoint unless an endpoint overrides it. Auth headers go here so you don't repeat them.
state is mutable per endpoint. Each request can read and update it, which is how pagination works.
pagination.next_state describes how to advance. stop_condition describes when to stop.
response.records.jmespath tells Sling where the actual rows live in the JSON payload.
primary_key lets Sling deduplicate when you sync incrementally.

That's most of the surface area. There's more (queues, iterate blocks, OAuth2, sync state, response rules), but you can write a working spec without any of it.

A real example: Stripe

Sling ships with an official Stripe spec, so you don't actually have to write one yourself. But the Stripe spec is also a good way to see how the moving parts fit together, because Stripe touches almost every feature of the format. Cursor pagination, incremental sync, parent-child endpoints, queues, response processors.

Here's the top of stripe.yaml:

name: "Stripe API"
description: "API for extracting data from Stripe payment processing platform."

queues:
  - credit_note_ids
  - customer_ids
  - invoice_ids

defaults:
  state:
    base_url: https://api.stripe.com/v1
    limit: 100
    anchor_date: '{coalesce(date_parse(inputs.anchor_date), date_add(now(), -1, "year"))}'
    anchor_unix: '{date_format(state.anchor_date, "%s")}'

  request:
    method: "GET"
    headers:
      Accept: "application/json"
      Authorization: Bearer {secrets.api_key}
      Stripe-Version: '{coalesce(secrets.api_version, "2023-10-16")}'
    parameters:
      limit: '{state.limit}'
      starting_after: '{state.starting_after}'
      created[gt]: '{state.created_gt}'
    rate: 20
    concurrency: 5

  response:
    records:
      jmespath: "data[]"
      primary_key: ["id"]

  pagination:
    next_state:
      starting_after: '{jmespath(response.records, "[-1].id")}'
    stop_condition: jmespath(response.json, "has_more") == false || length(response.records) == 0

The defaults block does a lot of work. Every endpoint inherits the auth header, the rate limit (20 req/sec), the concurrency (5 parallel requests), the pagination strategy (Stripe's starting_after cursor), and the basic response shape. Endpoints below this can be tiny.

Look at customer:

customer:
  description: "Retrieve list of customers"
  docs: https://docs.stripe.com/api/customers

  state:
    created_gt: '{coalesce(sync.last_created, state.anchor_unix)}'

  sync: [last_created]

  request:
    url: '{state.base_url}/customers?expand[]=data.discount'

  response:
    records:
      jmespath: "data[]"
      primary_key: ["id"]
      update_key: "created"

    processors:
      - expression: "date_parse(record.created)"
        output: "record.created_date"

      - expression: "record.created"
        output: "state.last_created"
        aggregation: "maximum"

      - expression: "record.id"
        output: "queue.customer_ids"

A handful of things are happening in this block:

created_gt reads from sync.last_created, which Sling persists between runs. First run uses the anchor date (one year ago by default). Every run after picks up where the last left off.
processors run on each record. The first one parses Stripe's Unix timestamp into a real date. The second tracks the maximum created value so the next run knows where to resume.
The third processor pushes every customer ID onto a queue, which a different endpoint can consume.

The queue trick is how you handle parent-child relationships. customer_balance_transaction reads from that queue:

customer_balance_transaction:
  description: "Retrieve customer balance transactions"
  docs: https://docs.stripe.com/api/customers/balance_transactions

  iterate:
    over: "queue.customer_ids"
    into: "state.customer_id"

  request:
    url: '{state.base_url}/customers/{state.customer_id}/balance_transactions'

  response:
    processors:
      - expression: "state.customer_id"
        output: "record.customer_id"

For each customer ID that came out of the parent endpoint, Sling fires a request to fetch that customer's balance transactions, then stamps the customer ID onto each row so you can join later. No glue scripts, no Python loop. The spec describes the relationship and Sling executes it.

Running it

You don't need to write any of this to use Stripe. The spec ships with Sling. You just point a connection at it.

Add an entry to ~/.sling/env.yaml:

connections:
  stripe:
    type: api
    spec: stripe
    secrets:
      api_key: sk_live_xxxxxxxxxxxxxxxx

  my_postgres:
    url: postgres://user:pass@host:5432/mydb

Test the connection:

$ sling conns test stripe

This downloads the latest Stripe spec from the Sling registry, runs a single-record probe against every endpoint, and reports back. On my machine it took about 90 seconds against a real Stripe account. The output looks like this:

INF testing endpoint: "customer_subscription_update"
INF testing endpoint: "customer_update"
INF testing endpoint: "invoice_update"
INF testing endpoint: "payment_method_update"
INF testing endpoint: "payout_update"
INF testing endpoint: "refund_update"
INF testing endpoint: "invoice_line_item"
INF testing endpoint: "setup_attempt"
INF testing endpoint: "subscription_item"
INF success!

You can list available endpoints:

$ sling conns discover stripe

+----+------------------------------+
|  # | ENDPOINT                     |
+----+------------------------------+
|  1 | account                      |
|  2 | charge                       |
|  3 | coupon                       |
|  4 | credit_note                  |
|  5 | customer                     |
|  6 | customer_balance_transaction |
|  7 | dispute                      |
|  8 | event                        |
|  9 | invoice                      |
| 10 | invoice_item                 |
...

Now write a replication. This file syncs every Stripe endpoint into Postgres tables under the stripe schema:

source: stripe
target: my_postgres

defaults:
  object: stripe.{stream_name}
  mode: incremental
  primary_key: [id]

streams:
  '*':

The '*' wildcard means "every endpoint." Sling figures out which ones support incremental sync (most of them) and which don't (account info, plans, products), and behaves accordingly.

$ sling run -r stripe_to_postgres.yaml

If you want a single endpoint instead, you can skip the replication file:

$ sling run --src-conn stripe --src-stream customer \
            --tgt-conn my_postgres --tgt-object stripe.customer \
            --mode full-refresh

I ran this against a small test account and got 5 rows in about a second:

INF connecting to source api (stripe)
INF writing to target file system (file)
INF wrote 5 rows to file:///tmp/stripe-customers.csv in 1 secs [3 r/s]
INF execution succeeded

That's the whole loop. There's no client library to install and no pagination loop to write yourself.

Forking the Stripe spec

The Stripe spec has 30+ endpoints, but maybe you only care about charges and customers. Or maybe Stripe shipped a new endpoint last week and you don't want to wait. You can fork the spec.

After running sling conns test stripe once, the spec lives at ~/.sling/api/specs/stripe.yaml. Copy it:

$ cp ~/.sling/api/specs/stripe.yaml ./my_stripe.spec.yaml

Edit it however you want. Then point your connection at the local file:

connections:
  stripe:
    type: api
    spec: file://./my_stripe.spec.yaml
    secrets:
      api_key: sk_live_xxxxxxxxxxxx

Specs can also live on S3, GitHub, or any HTTP URL. If your team has internal APIs, drop the spec in a Git repo and reference it by URL. Everyone gets the same definition.

What else specs can do

The Stripe walkthrough covers the common case: a vendor REST API with cursor pagination and a stable schema. Specs handle the weirder cases too:

OAuth2 flows. Client credentials, refresh tokens, on-the-fly token exchange. The auth section in a spec can describe a multi-step login dance.
Dynamic endpoints. When the list of resources isn't known until you query the API (think Airtable bases, Salesforce custom objects), you can generate endpoints at runtime.
Response rules. Retry on 429s with exponential backoff. Stop on a specific error string. Skip records that match a condition.
Multiple response formats. JSON is the default but specs can parse CSV, XML, and JSON Lines too.
Setup and teardown sequences. Some APIs need you to create a job, poll it, and then download a result file. Specs can describe that flow as a sequence of requests.

The full reference is in the API spec docs. The structure, authentication, request, response, queue, and dynamic-endpoint pages each cover one corner of the format.

The official spec library

We're maintaining a growing list of official specs you can reference by ID. The current set includes Stripe, HubSpot, Salesforce, GitHub, Shopify, Airtable, Polygon, AWS SES, dbt Cloud, Gmail, and dozens more. The list is at docs.slingdata.io/connections/api-connections. The source for every official spec is in the sling-api-spec GitHub repo if you want to read them or contribute.

If a spec you need isn't there, you can write one yourself, or open a GitHub issue requesting it. We've been adding new ones every couple of weeks.

Why this matters

Most data teams I talk to have at least one Python script somewhere that hits a vendor API, paginates through results, and dumps them in a database. It usually started as a quick fix and turned into a thing somebody owns. The script breaks when the API adds a field, or when the auth token rotates, or when the cron host gets reimaged.

Specs are an attempt to make that kind of script unnecessary. Describe the API once. Run it like any other Sling job. When something breaks, fix the YAML, not the codebase.

You'll need a CLI Pro license or Platform Plan to use API specs, since they're a Pro feature. Everything else in Sling stays open source.

Try the Stripe spec on a test account and let me know what you think. The full docs are at docs.slingdata.io/concepts/api-specs, and the #sling Discord is the fastest way to ask questions or request a new connector.

Sling API Specs: Pull Data From Any REST API With YAML

Introducing Sling API Specs

Why YAML for APIs

A real example: Stripe

Running it

Forking the Stripe spec

What else specs can do

The official spec library

Why this matters

Comments

More from this blog

Extract data from Databases into DuckLake

Introducing the Sling Data Platform

Efficient Data Lake Management with Sling and Delta Lake

Reading Apache Iceberg Data with Sling

Command Palette

Introducing Sling API Specs

Why YAML for APIs

A real example: Stripe

Running it

Forking the Stripe spec

What else specs can do

The official spec library

Why this matters

Comments

More from this blog