The ability to partition tables in BigQuery has been around for some time, and for people who deal with time series data this is a real boon. It is not only a cost saver but also a great time saver. There are two ways of partition a BigQuery table:
- Based on ingestion time, and
- Based on a user specified date column
As I deal with large amount of web analytics data, for over 200 sites, the ability to day-partition has been a very useful one. A single month’s data goes into terabytes and since BigQuery charges on the amount of data being queried (in addition to the amount of data being stored) a partitioned table allows me to specify my period of interest by date which can reduce the amount of data being queried drastically. However, even with day partitioning, a single day’s worth of data used to be upwards of 40-50 GB. And in the news business, the smallest useful period of analysis is a week which for my data always ended up close to 350 GB of data. Since people have daily routines around news consumption that vary more by day of the week then by week of the month, a week on week analysis lends itself to a good period for trend analysis.
For me the challenge was that I didn’t always want to analyse all my websites. In fact even if I wanted analyse just one of my 200 sites, my queries would effectively read a column for all my 200 sites. Sounds wasteful, suboptimal and not to mention, expensive. Enter clustered tables. In this post, I will cover how clustered tables helped me to solve that problem, and then some :)
Creating clustered tables
As far as I know, clustered tables are only created on time partitioned tables. While BigQuery’s web UI is quite capable, at the time of writing this post, it is not possible to create clustered tables via BigQuery’s web UI. The way I did it was to create a table first via command line interface (CLI) and then loaded my data into these clustered tables.
For this use case I am assuming the data is stored in GCS and that it is ndjson and compressed files. While it possible to create a partitioned table at the time of loading the data (as in the table creation and loading data into it can be done at the same time) it is not possible (as of now) to simultaneously create a clustered table. Hence this is a two step process where Step 1 is to create an empty table; and Step 2 is to load data into it.
In my dataset, I have over 130 columns but the columns of interest look like this:
## timestamp site_code event
## 1 2018-10-31 13:45:30 UTC Site_1 pageview
## 2 2018-10-30 12:45:34 UTC Site_2 event_1
## 3 2018-10-29 15:05:24 UTC Site_1 event_2
## 4 2018-10-28 11:15:54 UTC Site_3 pageview
## 5 2018-10-27 01:45:11 UTC Site_n event_n
Given my sample dataset my schema will look like this:
[
{"name" : "timestamp",
"type": "TIMESTAMP",
"mode": "NULLABLE"},
{"name" : "site_code",
"type": "STRING",
"mode": "NULLABLE"},
{"name" : "action",
"type": "STRING",
"mode": "NULLABLE"}
]
For the sake of simplicity store the above json in current working directory as myschema.json
Note that my partitioning filed is going to be timestamp
and my clustering fields are going to be site_code
and action
. The order in which clustering is done matters. Remember the clustering order when you run queries on table.
Create a dataset in BigQuery called my-dataset
.
Now call gcloud sdk’s bq command in your terminal to create a table.
bq mk -t --schema ./myschema.json --time_partitioning_type=DAY --time_partitioning_field=timestamp --require_partition_filter=TRUE --clustering_fields='site_code,action' my-dataset.my-clustered-table
This should create a new table called my-clustered-table
within an existing dataset called my-dataset
.
Now load the data into the table using gcloud sdk’s bq command in your terminal.
bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records=1000 my-dataset.my-clustered-table gs://my-bucket/my-json-files/*
This should do the job.
Now when a query is run, it is important to note two things:
- Date partitioning - remember to specify the date range of interest or else BQ will query the entire database
- Remember to not just use the clustering fields but to use them in the same order in which they were specified at the time of creation of table
So, let’s assume that I need to run a query that shows me all the pageviews for my website called Site_1 for the month of Sep-2018. For a non-clustered table my query will look like this:
--running this query on a non-clustered table and clustered table
SELECT
*
FROM
`my-project:my-dataset.my-clustered-table`
WHERE
timestamp BETWEEN
TIMESTAMP("2018-09-01") AND TIMESTAMP("2018-09-30")
AND site_code = "Site_1" --clustering order is important here
AND action = "pageview"
The above query for me typically queried 1TB of data. But the same query on a clustered table, on average, queries less than 100GB of data. That is a saving of 90% on the original query. Note that BigQuery is not very accurate at predicting the amount of data that will be queried when the table is partitioned. Google plays it safe by predicting the maximum size of the table that is likely to be queried. In my experiene Google predicts that the above query would read 500GB data even though it reads 1/5th of that.
Hope this was helpful. Happy querying.