in redshift bigquery ETL AWS GCS ~ read.

Extracting data from Bigquery into Redsfhit

Another morning, another challenge! :) This time I have to integrate data collected into Google Ads into our data warehouse based on Redshift. I started with Google and find several tricky ways to get data into Redshift.

  1. Some paid ETL services.
  2. Self-hosted script to capture data from Google Ads directly via API.
  3. Transfer Google Ads data into Google BigQuery and next turn my task into "Transfer data from Bigquery into Redshift".

As you can see based on a title, I started to dig into the last option as it looks like the thing I can easily implement. We are using singer.io framework in our tech stack and I was thinking that Singer has BigQuery adapter for sure! So I will just put another adapter into my singer repo and the data start to flow.

Signer actually has official Bigquery adapter... but it is a target adapter (in terms of singer framework it means the adapter able to write into Bigquery, not extract). However, I have noticed that there is some very simple BigQuery extract adapter done by a single person. I give it a try and it works in a very simple and reliable way (read adapter in Singer are called 'tap'). It has several issues:

  1. It does not fully support singer contract, so not all signer target adapters are compatible with it.
  2. There is no ready package in Pip: you have built and install it by yourself.
  3. The description is outdated and copy-pasted from another tap.

I was able to resolve all those issues and make a pull request with updates in the original repo and it was merged by the author. I also make a package, so you are able to use it via 'pip install'.

Under the hood, it just extracts data as CSV files via BigQuery API, put it on S3 bucket, and run COPY command to import data into Redshift.

To create pipeline between BigQuery and Redshift locally you need next commands:

>python -m venv ./app/env_bigquery
>source ./app/env_bigquery/bin/activate; pip install tap-bigquery;
>python -m venv ./app/env_redshift
>source ./app/env_redshift/bin/activate; pip install pipelinewise-target-redshift;

>export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"

>tap_bigquery -c config.json -d > catalog.json

>tap_bigquery -c config.json --catalog catalog.json --start_datetime '2020-05-01T00:00:00Z' --end_datetime '2020-05-01T01:00:00Z' | target-redshift --config redshift_config.json

It is quite easy: install tap and target in separate python environment, export secret key for the service account in Google (check it in readme to tap repo), amend config for tap. Next, I run the tap in discovery mode (-d option), it will produce schema of BigQuery table as output. Check it! I have to amend the type of several fields. As the final step just start tap-bigquery with all options, including start and end date-time (it will limit extracted data by datetime_key you can specify for every table in config) and redirect its output in redshift target.

For incremental extraction you have to use keys start_datetime and end_datetime (proper support for singer state message not implemented). I have created such sh script to work around it:

#!/bin/bash

source ./state.conf

if [ -z "$endDate" ]
then
      echo "\$endDate is empty, use default value"
      endDate=$(date --utc +%F --date="15 days ago")
      echo $endDate
fi

startDate=$endDate
endDate=$(date --utc +%F --date="1 days ago")

export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"

./env_bigquery/bin/tap_bigquery -c bigquery_config.json --catalog catalog.json \
--start_datetime $startDate --end_datetime $endDate | \
./env_redshift/bin/target-redshift --config redshift_config.json

declare -p endDate > ./state.conf

Just save and load value for parameter start_datetime from a file via standart bash commands source and declare. Quite simple.