Snowflake Stages and External Tables with JSON Blobs in AWS

Snowflake has a few ways to interact with external data, one of which is Stages and external tables. The docs on most things in Snowflake are amazing, so I'm not going to copy them here, this will just be a guide on how to set up large JSON blobs with these tools. We will be creating an External Table in Snowflake which queries the underlying JSON blob residing in an AWS s3 bucket.

Setup

To get started we will be operating with AWS and Snowflake, so access to these would be required. Once you have these we will be working in steps:

create internal Stage
copy into from this Stage
create External Stage
create External Table

To start we should create a PoC workspace, we will be using whichever is the default warehouse for your user for loading data:

1
2
3
4
5
USE ROLE sysadmin;
CREATE OR REPLACE DATABASE stages_and_external_tables_poc;
USE DATABASE stages_and_external_tables_poc;
CREATE SCHEMA IF NOT EXISTS external_poc;
USE SCHEMA external_poc;

Now we can get started.

1. Stages

Creating an internal stage is as simple as following the docs. We will be using some mock data I have pulled from mockaroo.

Here's the code to create the internal stage, to start. We will create a minimal named and internal stage & copy the files in. (we will add bells and whistles later)

Table Stages

Each table implicitly has a stage associated with it, named @%<table_name>. So we can just create a single column table in order to store our JSON blobs, explicit telling it were using a JSON blob rather than a CSV. For the file path structure; I'm using a Mac, but similar file structures are found here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE OR REPLACE TABLE src_mock_data
  STAGE_FILE_FORMAT =
(
  TYPE = JSON
)
(
  file_content VARIANT
);
PUT file:///Users/toby.devlin/dev/data/mocks/mock_1.json @%src_mock_data;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_2.json @%src_mock_data;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_3.json @%src_mock_data;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_4.json @%src_mock_data;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_5.json @%src_mock_data;
LIST @%src_mock_data;

Now we can copy into this table the files we've just uploaded. The data will be copied as everything into the variant type column we created. You can't create more than 1 column with JSON -> VARIANT types, but CSVs can have multiple columns.

1
2
3
COPY INTO src_mock_data FROM @%src_mock_data;
SELECT *
  FROM src_mock_data;

Note that we don't remove the files from the stage if we run LIST @%src_mock_data; again. There are ways to dictate how the stage files are treated and how semi-structured data is injected, see the copy-options for more info.

Named Stages

Named stages are almost the same as table stages, but you have to define them. They live in a schema and will appear when queried with SHOW STAGES and settings shown with DESCRIBE STAGE.

1
2
3
4
5
6
7
8
9
10
CREATE OR REPLACE STAGE poc_internal_stage
  FILE_FORMAT = ( TYPE = JSON);
SHOW STAGES;
DESCRIBE STAGE poc_internal_stage;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_1.json @poc_internal_stage;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_2.json @poc_internal_stage;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_3.json @poc_internal_stage;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_4.json @poc_internal_stage;
PUT file:///Users/toby.devlin/dev/data/mocks/mock_5.json @poc_internal_stage;
LIST @poc_internal_stage;

Again, we can copy these files into a table, but we will need to create a table if one doesn't exist. This example also shows copy as a select statement, allowing arbitrary transform statements to be executed. $n is the "column", always 1 in our JSON use case and :<element> refs the path in the JSON itself. We could also use selectors for specific files in the sage, such as data.$1:id from FROM @poc_internal_stag/mock_1 AS data, for example.

1
2
3
4
5
6
7
8
9
10
11
CREATE OR REPLACE TABLE src_mock_data_2
(
  file_content     VARIANT,
  file_name        STRING,
  record_id        INT,
  record_last_name STRING
);
COPY INTO src_mock_data_2 FROM (SELECT $1, metadata$filename, $1:id, $1:last_name
                                  FROM @poc_internal_stage);
SELECT *
  FROM src_mock_data_2;

2. External Stages

As we progress through to externalizing the stages, the concept remains the same. We have files with data, and we want them to be data in a table. However, now these files may live in cloud storage object store. I'm most familiar with AWS, so I will create an S3 bucket to store these objects in. Below is the Terraform for creating a private bucket. It also uploads the local file data using the aws_s3_object. Be sure to replace the file path as above.

Note: you will need to provide the Terraform user with policies in line with IAMFullAccess and AmazonS3FullAccess to complete the next steps. Either directly or via an assumed role; I'm using a shortcut and attaching these to the user.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "4.21.0"
    }
  }
}

provider "aws" {
  region     = "us-east-1"
  access_key = "" #  replace me
  secret_key = ""
}


module "s3_bucket" {
  source     = "terraform-aws-modules/s3-bucket/aws"
  version    = "3.3.0"
  bucket     = "poc-snowflake-external-stage-bucket"
  acl        = "private"
  versioning = {
    enabled = false
  }
}

resource "aws_s3_object" "object" {
  bucket = module.s3_bucket.s3_bucket_id
  key    = "mock_${count.index+1}.json"
  source = "/Users/toby.devlin/dev/data/mocks/mock_${count.index+1}.json"
  etag   = filemd5("/Users/toby.devlin/dev/data/mocks/mock_${count.index+1}.json")
  count  = 5
}

Now we have the bucket up and running we can create the stage in snowflake and provide it accesses. Along with the bucket, we will also create an IAM user to interact with the bucket rather than use our Terraform provisioner user. This is an optional step if you want to use your admin user you can if it has the permissions.

Note: the file setup_stage.sql will contain the secret access keys for this user. You should ensure the values in this file remains secret; in the real world it may be worth placing this into something like AWS Secrets Manager or manually creating IAM creds as there are some security flaws using this shortcut approach.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
resource "aws_iam_user" "snowflake_s3_accessor" {
  name = "snowflake_s3_accessor"
  path = "/snowflake/"
}

resource "aws_iam_access_key" "snowflake_s3_accessor" {
  user = aws_iam_user.snowflake_s3_accessor.id
}

data "aws_iam_policy_document" "snowflake_s3_access" {
  statement {
    sid     = "${replace(title(module.s3_bucket.s3_bucket_id), "-", "")}Access"
    actions = [
      "s3:ListBucket",
      "s3:GetBucketLocation"
    ]
    resources = [
      "arn:aws:s3:::${module.s3_bucket.s3_bucket_id}",
    ]
  }
  statement {
    sid     = "${replace(title(module.s3_bucket.s3_bucket_id), "-", "")}ItemAccess"
    actions = [
      "s3:PutObject",
      "s3:GetObject",
      "s3:GetObjectVersion",
      "s3:DeleteObject",
      "s3:DeleteObjectVersion"
    ]
    resources = [
      "arn:aws:s3:::${module.s3_bucket.s3_bucket_id}/*",
    ]
  }
}

resource "aws_iam_user_policy" "snowflake_s3_access" {
  name   = "snowflake_s3_bucket_access"
  user   = aws_iam_user.snowflake_s3_accessor.name
  policy = data.aws_iam_policy_document.snowflake_s3_access.json
}

resource "local_sensitive_file" "iam_creds_out" {
  filename = "setup_stage.sql"
  content  = <<-EOT
        CREATE OR REPLACE STAGE poc_external_stage
          URL = 's3://${module.s3_bucket.s3_bucket_id}'
          CREDENTIALS = (
            AWS_KEY_ID = '${aws_iam_access_key.snowflake_s3_accessor.id}'
            AWS_SECRET_KEY = '${aws_iam_access_key.snowflake_s3_accessor.secret}'
          )
          FILE_FORMAT = (TYPE = JSON);
        EOT
}

Now we have everything we need to get started, we can begin the Snowflake side of things. We want to create the stage in a similar way to before but using the External syntax. All the variables needed can be found in setup_stage.sql.

1
2
3
4
5
6
7
CREATE OR REPLACE STAGE poc_external_stage
  URL = 's3://poc-snowflake-external-stage-bucket'
  CREDENTIALS = (AWS_KEY_ID = '<your_access_key>' AWS_SECRET_KEY = '<your_secret_key>' )
  FILE_FORMAT = ( TYPE = JSON );
SHOW STAGES;
DESCRIBE STAGE poc_external_stage;
LIST @poc_external_stage;

Now the stage is created we can also see the existing files have been associated with the stage with LIST @poc_external_stage;. From here the same approach as before can be taken to copy in these files as if it were an internal named stage.

1
2
3
4
5
6
7
8
9
10
11
CREATE OR REPLACE TABLE src_mock_data_3
(
  file_content     VARIANT,
  file_name        STRING,
  record_id        INT,
  record_last_name STRING
);
COPY INTO src_mock_data_3 FROM (SELECT $1, metadata$filename, $1:id, $1:last_name
                                  FROM @poc_external_stage);
SELECT *
  FROM src_mock_data_3;

3. External Tables

External tables are, syntactically, very similar to normal tables. As the docs describe they essentially read files from the remote store when requested. As with the above sections, there are steps we can take to improve the performance, security and isolation of these resources, but we will focus on getting it up and running.

The first step is to create the table itself.

1
2
3
4
5
6
7
8
 CREATE OR REPLACE EXTERNAL TABLE src_mock_data_external
(
    file_name        STRING AS (metadata$filename),
    record_id        INT AS (VALUE:"id"::INT),
    record_last_name STRING AS (VALUE:"last_name"::VARCHAR)
)
    LOCATION = @poc_external_stage
    FILE_FORMAT = (TYPE = JSON);

The content will automatically be populated into a table we can query To test out the new data you can place new files into the bucket and run the refresh command ALTER EXTERNAL TABLE src_mock_data_external REFRESH; then the select query again, showing updates to the underlying data.

1
2
3
SHOW EXTERNAL TABLES;
SELECT *
  FROM src_mock_data_external;

This manual refresh has to be done each time unless updates to this table are published to the AWS SQS topic that is created for the table. This can be SQS topic can be hit by anything, but the recommended way is by publishing s3 change events.

Caveats & Considerations

When working with extremely large JSON blobs, larger than the max size (16,777,216 bytes) stages will fail their COPY INTO commands & External Tables will fail silently to load the data into the table.
We have taken some shortcuts such as creating the AWS integration query in a local file - This can be hardened with proper secrets management & terraform state storage.
Setting up change notifications on external stages is probably a very useful tool, meaning you could even process these files in multiple systems at once.
Understanding the COPY INTO settings and various Stage settings allows for flexible operations on the stages files & how to reduce processing after load.
The billing associated with Stages is part of the Snowflakes serverless billing and should be understood before heavyweight processing.