Connect Snowplow to Snowflake

I have already mentioned that we are using Snowplow as our backbone system to collect our events. It was not mentioned that the last time setup Snowplow in pair Redshift was a long and awful ride. Mostly because of quite a bad Snowplow documentation and some mess with existing Snowplow components. However, I was thinking about making a proper tutorial regarding Snowplow set up once we get finished, the issue with Snowplow docs was to complex to get all details in one list several weeks after. So I let it go.

Last week I need to check if it is possible to setup Snowplow Loader in connection with Snowflake. I was expecting some issues with documentation and prepared to write down all my pain during the journey. So here it, my tutorial "How to set up Snowplow components with Snowflake". I hope it will save some time for anyone who will go after step by step. But before we go, couple of notes:

Start with official documentation: Just for the beginning read all 5 articles. It does not step by step tutorial. Sometimes you need knowledge from the last articles to perform some tuning in action mentioned in the first one.
Just take that everything should not be started locally (or at least, it will need some special tricks). I have to spend a lot of time just trying to get all jars mentioned in the official documentation. It could not be done, because some of jars are store in private buckets, for others it is really hard to find a link to download (links are spread all around old and new documentation).

Let's start with setup article: Just go over the document until Setting up Snowflake and finish snowflake storage integration. I would add that you should also perform some missed actions. After creating of integration need to grant owner to sysadmin. And be sure your Snowflake stage does not contain - (it makes loader to produce an error).

It is strange but Setting up Snowflake should be skipped. It could be misread and looks like you need to perform some CLI command to setup Snowflake, but it is not so. Just skip it and go ahead.

Your next stop is DynamoDB. I'm not sure what Snowplow developer means by manifest word. It confused me a lot, giving me an idea that I missing some part because there was no definition of such term... Later, I find out that manifest is the name of the field where you should put DynamoDB table name in config. Anyway, all you need is just create a table in DynamoDB with one column called RunId and write down the name of the table. You need it later.

The next stop is Snowflake. It looks like you need to create atomic.events table in Snowflake on your own (or not?). Not clear what you should do here... Nothing. It is mentioned in setup instruction just FYI, the table will be created automatically.

Next, Dataflow Runner has two big JSON configs. It is not a problem to fill all gaps in configs. I'd recommend you to raise amiVersion from 5.9.0 to 5.18.0. It unlocks you EC2 instances of the 5th generation (with 5.9.0 you can use only old 4th gen instances). It is better to amend the default configuration of spark in cluster.json. It could significantly increase the performance of a cluster, but it should be tuned for a particular task. My config, that works for me well, just for reference:

"configurations":[
  {
    "classification":"core-site",
    "properties":{
      "Io.file.buffer.size":"65536"
    }
  },
  {
    "classification":"mapred-site",
    "properties":{
      "Mapreduce.user.classpath.first":"true"
    }
  },
  {
    "classification":"yarn-site",
    "properties":{
      "yarn.resourcemanager.am.max-attempts":"1"
    }
  },
  {
    "classification":"spark",
    "properties":{
      "maximizeResourceAllocation":"false"
    }
  },
  {
    "classification":"spark-defaults",
    "properties":{
      "spark.dynamicAllocation.enabled": "false",
      "spark.executor.instances": "6",
      "spark.yarn.executor.memoryOverhead": "1024",
      "spark.executor.memory": "9G",
      "spark.executor.cores": "5",
      "spark.yarn.driver.memoryOverhead": "2048",
      "spark.driver.memory": "13G",
      "spark.default.parallelism": "60",
      "spark.network.timeout": "800"
    }
  }
]

Next, it looks like you need to run something within your CLI (dataflow-runner), but you do not. It should be started in a cloud with access to Amazon EMR. So just leave it as is. In the playbook, you can see the mention of loader-config file. It is created on the next page.

Do you remember DynamoDB manifest? You should put it here. All the rest items are obvious.

Next page is Cross-batch deduplication. It was not important for me, so I have skipped it.

Backpopulate the manifest is very important in case you are perform migration from on storage to Snowflake. You can just do not import all raw Snowplow events with EMR cluster, but copy it from your storage to Snowflake table directly and mark that data as already imported for Snowflake loader. Be careful there are several errors on the page.

First, of all you won't be able to complete that CLI commands:

$ pip install boto3 snowplow_analytics_sdk
$ wget https://raw.githubusercontent.com/snowplow-incubator/snowplow-snowflake-loader/release/0.4.0/backfill.py   # Won't actually be downloaded as repository is private

The link is incorrect (According to the comment, the author was sure that it is broken, but forget to fix it). Instead, use:

pip3 install boto3 snowplow_analytics_sdk

wget https://raw.githubusercontent.com/snowplow-incubator/snowplow-snowflake-loader/release/0.7.1/backfill.py

Next, CLI command also has an error:

./backfill.py \
    --startdate 2017-08-22-01-01-01 \
    --region $AWS_REGION \
    --manifest-table-name $DYNAMODB_MANIFEST \
    --enriched-archive $TRANSFORMER_INPUT \
    --aws-access-key-id=$AWS_ACCESS_KEY_ID \
    --aws-secret-access-key=$AWS_SECRET_KEY

You won't be able to run Python file as a regular bash file. So correct command looks like this:

python3 backfill.py --startdate 2020-12-23-01-01-01 --region us-east-1 --manifest-table-name your-dynamodb-table --enriched-archive s3://path/to/archive --aws-access-key-id=${AWS_ACCESS_KEY_ID} --aws-secret-access-key=${AWS_SECRET_ACCESS_KEY}

At last, you are ready to make a push your configs to some EC2 instance, that will start EMR cluster. What you should have in your push:
cluster-config.json
playbook.json
iglu-resolver.json (just take a look my other article)
run_emr_cluster.sh

In the last .sh file you should put all CLI commands you have seen over the official tutorial. Take a look at mine, for example:

<some regular commands>...
pip3 install boto3 snowplow_analytics_sdk

wget https://raw.githubusercontent.com/snowplow-incubator/snowplow-snowflake-loader/release/0.7.1/backfill.py

python3 backfill.py --startdate 2020-12-23-01-01-01 --region us-east-1 --manifest-table-name your-dynamodb-table --enriched-archive s3://path/to/archive --aws-access-key-id=${AWS_ACCESS_KEY_ID} --aws-secret-access-key=${AWS_SECRET_ACCESS_KEY}

aws s3 cp s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.7.1.jar .

java -jar snowplow-snowflake-loader-0.7.1.jar setup --base64 --config $(cat ./loader-config.json | base64 -w 0) --resolver $(cat ./iglu-resolver.json | base64 -w 0)

/dataflow/dataflow-runner run-transient --emr-config ./cluster-config.json --emr-playbook ./playbook.json