Since I started working at Reflection one of our main goals has been to centralise all of our data into Google Cloud’s BigQuery platform. This would enable our overworked data scientist to undertake analysis and statistics by means of just a few queries and without the need to write long python scripts. As we collect approximately 6-7 GigaBytes of data a day this task was easier said than done.
The BigQuery service is able to query huge amounts of data fast from multiple tables without us worrying about optimising the data structure or table indexing. In contrast to relational databases, BigQuery does not allow primary keys, table indexing or foreign keys and discourages update and deletion of rows. The main expense for the BigQuery service is actually querying, in fact, each TeraByte of data processed costs $5. This can become rather expensive as the system scans through all rows of a table when running a query (no table indexing). Therefore if you were to run 1000 queries over a table containing 10GB of data would cost approximately $10. Our own monthly BigQuery bill runs to around $200 a month.
Our first attempt at consolidating everything was to create new BigQuery tables every day and stream all the data we were collecting row by row. Creating a table every day is recommended to keep querying costs down. The streaming approach was very unreliable and slow for our purposes mainly because of the large amount of data we need to save. Because of this we had to split the job of saving the data into multiple tasks and sometimes a task would fail halfway through and re-run resulting in duplicate entires (remember no primary keys) in the BigQuery table. To make matters worse when the data we were saving was time sensitive we ended up with duplicate entries from two different moments in time. This could have resulted in things like displaying two different ranks for the same app in a single leaderboard. One of the last drawbacks of streaming data into BigQuery is that it costs 0.05 dollars per GigaByte. Speed-wise trying to save a 3 GigaByte file takes approximately 6 hours still we were trying to improve the performance by splitting the job into multiple tasks and the tasks across multiple machines. After some investigation we realised that this is not the only approach to load into BigQuery!
Digging through the Google Cloud documentation we realised that there are 3 ways of loading data into BigQuery:
- Loading a CSV or JSON file from Google Cloud Storage (GCS) into a native table.
- Referencing a table to a list of CSV/JSON files in Google Cloud Storage.
With the exception of streaming, loading data into BigQuery is free. Therefore, our second attempt was to translate the data we collect into CSV, save it into GCS and then load it into a BigQuery native table. This time things went much smoother: the data in BigQuery was sound, the time sensitive data consistent, and uploading the whole of 6-7 GigaBytes of data to BigQuery was relatively fast. In fact, to load a 3 GigaByte file from storage into a BigQuery took approximately 5 minutes.
There is one drawback to this approach: often there was the need to change and/or update past data. In order to do this we would have to to re-upload all the CSV files into the BigQuery table. Fortunately, the ‘referencing a table’ approach came in handy. We first separated the static tables (those created once and filled with data would not change in the future) and dynamic tables (tables that can be updated more or less often). The dynamic tables would then be created by pointing BigQuery to a Google Cloud Storage directory (e.g. “gs://bucket/folder/date/*”). BigQuery would dynamically pick up new files in that folder without the need to recreate and reload the tables. By simply translating the data into CSV and saving it in the right GCS folder at any point in time we would have the BigQuery table updated automatically. The only drawback to this approach is that querying referenced tables can be slower than the native tables. For example querying a 17 GigaByte native table takes approximately 3 seconds whereas querying a referenced table of the same size can take around 33 seconds. There are also other limitation of using this approach: you can find more here External-Data-Source.
Pros & Cons of Three Types of BigQuery Data Loading
- Pros: Easy to save new rows (maybe).
- Cons: Can become costly depending on the amount of data stored; slow to upload large amounts of data.
- Pros: Free.
- Cons: Unable to easily update the data in a table without recreating the whole table.
- Pros: Free, Easy to save update and delete chunks of data.
- Cons: Querying is slower for the user.
There is a nice additional feature that BigQuery provides: when creating native tables and referenced tables you can either construct the table fields and datatypes on your own or let BigQuery do the job for you. This can come in handy, although it is not 100% reliable since the schema is generated by scanning only the first 100 rows of the CSV files. Find more about automatic schema detection here BigQuery Schema Detect.
The Conclusion is…
In conclusion, BigQuery is a very easy and efficient tool for data analysis however getting the data into the platform can be complicated and laborious. This effort was worth it in our case because now our data scientist can access all the data he needs without having to do additional work.