What BI Development Looks like with BigQuery, Google Cloud APIs, Looker and Fluentd (courtesy of…

For the last six months I’ve been working in London helping build out an analytics service built on Google’s Cloud Platform, BigQuery and the Looker BI tool. It’s been a very interesting and rewarding experience working within product management in an engineering team using agile development practices, developing with node.js and clojure and shipping releases every few months.

However, working in product management rather than in development or consulting means I’m more likely to be creating product roadmaps and working on partner enablement than developing using BigQuery and Looker, so I try and use the journey time up to London each day to do some development myself, work out what development requests and users stories would be easy or hard to have our engineering team deliver … and because as a techie and career-long BI and data warehousing developer this stuff is just too cool and too interesting to not get developing with.

So over the past few months I’ve been taking the IoT and wearables data analysis system I put together using Hadoop running on servers back home and ported it to Google BigQuery and more recently Looker running in the cloud, using Fluentd as the log aggregation engine running on a Google Compute Engine-hosted VM to collect, process and ship to BigQuery all the data from wearable devices, social media and other services along with IoT event data from SmartThings that I previously fed into logstash and Hadoop. Adding Google Cloud Storage for staging incoming data extracts from services such as Moves and Google Takeout and running it all in Google Cloud Service, this updated version of my analytics architecture looks like the diagram below.

Typical cost for running a VM hosting a simple Fluentd server and running all month is around $80. BigQuery is more or less free for developer use with charging based largely on how much you query with some costs for streaming inserts and data storage, but with the first 1TB of queries free each month and minimal storage needs, Google’s cloud data warehouse platform has pretty-much taken over from Oracle as my day-to-day development platform because, like Oracle did in the days of OTN downloads of on-premise developer tools and databases with permissive licensing for personal training and developing prototype applications, you can learn the technology essentially for free and really get to know and understand the tools without worrying about short trial license periods or having to pay thousands of dollars for full commercial licenses.

I’ve written and presented on Google BigQuery a few times since being introduced to it late last year so I won’t go into how this distributed, column-store query engine based on Google’s Dremel engine works, but there’s another technology I used as part of this round of development, an open-source technology called Fluentd that I used instead of Flume and Logstash for the log aggregation part of this project that I’ve found particularly useful given its range of input and output data source plugins, and its support for PubSub, BigQuery and other parts of the Google Cloud Platform making it easy to accept data from IFTTT and SmartThings and then stream it, lightly transformed and consistently timestamped, into a set of BigQuery tables.

For example, creating a data feed that brought in metrics on car journeys via the Dash IoS app and a small bluetooth device that plugs into your car’s diagnostics port involves first registering an IFTTT applet to trigger when that app registers a journey complete event, with the applet payload then being an IFTTT Maker webhook that sends the journey metrics to Fluentd in JSON format as an HTTP POST request.

Fluentd then receives the JSON document using its HTTP Input plugin, processes, transforms and timestamps the record and then sends it as a streaming insert into a one of the tables I’ve setup in Google BigQuery, over which I’ve then created a SQL view that transforms all of IFTTT’s odd “3h 5m” and “August 14, 2004 at 10:15PM” date and time formats into additional timestamps that Looker and other BI tools can automatically parse into date and time elements.

select date_time, 
timestamp_trunc(date_time,DAY) as date_day,
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(StartedAt,', ',' '),' at','')) as journey_start_date_time,
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(EndedAt,', ',' '),' at','')) as journey_end_date_time,
TIMESTAMP_DIFF(PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(EndedAt,', ',' '),' at','')),
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(StartedAt,', ',' '),' at','')), MINUTE) AS journey_mins,
cast(replace(DistanceDrivenWithLabel,' Km','') as FLOAT64) as distance_km,
cast(replace(FuelConsumedWithLabel,' l','') as FLOAT64) as fuel_l,
cast(replace(AvgDistancePerHourWithLabel,' kph','') as FLOAT64) as speed_kph,
cast(replace(AvgFuelConsumptionWithLabel,' kpl','') as FLOAT64) as fuel_econ_kpl,
from `xxxxx.personal_metrics.fluentd_dash_journeys`

One of the nice things you can do with Fluentd is call arbitrary ruby scripts to enrich incoming data via the Fluentd script plugin, something I’ve used to analyze for sentiment and key entities all incoming tweets, emails, article saves and other social media and other communications using Google’s new Natural Language Processing API. To do this I pass all incoming records of these types through a filter definition in the Fluentd config file, like this…

<filter reformed.socialmedia>
type script
path /home/mark/gcp_nlp/nlp_process.rb

…with that script then using the Ruby Cloud Natural Language Client to calculate the sentiment score and magnitude of that sentiment, extract the entities from the incoming text body and then pass these as new Fluentd record keys back to the output plugin to write as additional columns into the BigQuery table I used to store all these communication events.

def configure(conf)
require "google/cloud/language"
def start
def shutdown
def filter (tag, time, record)
require "google/cloud/language"
project_id = "xxxxxxxx"
language = Google::Cloud::Language.new project: project_id
text = record["post_body"]
document = language.document text
sentiment = document.sentiment
entities = document.entities
record["sentimentscore"] = "#{sentiment.score}"
record["sentimentmagnitude"] = "#{sentiment.magnitude}"
entitylist = ""
entities.each do |entity|
entitylist = entitylist + "#{entity.name} "
record["entityname"] = "#{entitylist.strip}"

Fluentd can also collect the device and sensor events from my SmartThings hub using a groovy “smartapp” running in Samsung’s SmartThings backend cloud service. All that’s left to do is join the incoming hub events to a lookup table that classifies the IoT events as sensors, switches, colour settings and other event types and groups them by room and you’ve got a useful set of home enviroment and activity data to join up with the other dataset.

Other services such as Moves have APIs that I connect to every few hours using clients such as this one also on Github that runs on the same Google Compute Engine VM that runs the FluentD log collector, with new movement data copied across to a Google Cloud Storage bucket and inserted every few hours into another BigQuery table ready for analysis.

I’ve also been collecting weight readings and sleep readings but these can be sporadic, with weight collected via a Withings WiFi scale but often only every few days, whilst sleep time in minutes should be recorded every day but sometimes I forget to wear my health band, so I used BigQuery SQL’s analytic windowing functions to calculate my average weight over 7 days (a good approach anyway as your weight can fluctuate day-to-day), and over three days for sleep.

AVG(weightKg) OVER(ORDER BY ts
SELECT TIMESTAMP_TRUNC(d.date_time,DAY) as date_time, avg(w.weightKg) as weightKg, avg(w.FatMassKg) as FatMassKg, UNIX_SECONDS(TIMESTAMP_TRUNC(d.date_time,DAY)) AS ts
FROM `aerial-vehicle-148023.personal_metrics.date_dim` d
LEFT OUTER JOIN `xxxx.personal_metrics.fluentd_weighings` w
ON d.date_time = TIMESTAMP_TRUNC(w.date_time,DAY)
GROUP BY d.date_time, ts
order by date_time asc

These two views are then left-outer joined to the main health daily view so that I’ve got weight and sleep readings every day alongside step count, calories burned and other metrics from my Jawbone UP health band.

So where this leaves me now is at the point where I’ve got a pretty interesting dataset running in a Google BigQuery cloud-hosted column-store data warehouse, with the data updated in real-time and enriched and transformed ready for analysis.

For now though the thing that’s made this all really interesting is joining it all up and then analyzing it as one big dataset using Looker, another tool I talked about recently on this blog and one which re-introduces an old concept familiar to BI veterans, the “semantic model”, to this new world of distributed, cloud-based big data analytics.

Looker the company and Looker the BI tool are both attracting a lot of attention right now after Google’s recent investment, and the general buzz around big data analytics running on Google and Amazon’s cloud platforms. I’ve been working with Looker now for around six months as the preferred BI tool for the platform I’m working with, and Looker’s philosophy towards analytics together with the parallels I can see between the enterprise semantic models I used to create working with tools such as Oracle BI and Looker’s updated take on that idea led me to sign-up as a developer partner with Looker in order to get to understand their platform and technology even closer, the same thought process I went through just over 10 years ago when Oracle acquired Siebel and introduced me to Siebel Analytics and what became the focus of my career and the subject of two books, what in-time became Oracle Business Intelligence.

To be clear and with the benefit of having worked with BI tools for almost 20 years Looker the BI tool is still very primitive in many ways — in terms of data visualization options and general UI and usability it’s frankly, pretty basic and it’s quirky use of terminology such as “looks” (for reports), “explores” (for subject areas) and “views” (for tables) and many activities we’re used to being simple and intuitive in existing BI tools such as creating and saving catalogs of reports are non-obvious to end-users or just plain missing … but crucially, all of these missing features or UI quirks can be iterated on in a SaaS product and instead, Looker have got two things right:

  • Their SQL query generation engine was designed from day one to work with distributed query engines like BigQuery
  • They’ve taken the concept of a semantic model, data model abstraction and combining enterprise datasets but crucially, done it in a way that software engineers and data analysts working in their target market can relate to

Looker was designed from the start to query the underlying data source directly rather create a mid-tier cache and query that instead of the underlying data source; Tools like Tableau and Qlikview did well for many years running their user queries of a local cache on the users’ desktop and gave those users much faster response times than they were used to with old, enterprise BI tools that tried to speed up queries by creating layers of aggregate tables, but this approach just isn’t practical with massive, constantly-updated big data sets

Looker also generates SQL queries aware of the way platforms like BigQuery charge for their use — by the amount of data queried rather than server processors or named users. BigQuery is a column-store database that brings back data only for those columns you ask for, but tools like Tableau routinely request all columns for a table making them impractical to use for these types of data stores; Looker also aggressively elimates joins from queries and supports the nested column approach preferable for distributed data stores like BigQuery where joins are much more expensive to process than we’re used to with relational databases

Finally, Looker has introduced a concept very familiar to traditional BI developers but new to most developers working on big data analytics projects, the concept of a semantic model, metadata layers and data abstraction; importantly though, the approach they’ve taken in LookML is designed to appeal to the actual developers building out these types of systems today — software engineers working on web and big data projects who write code, use modern development lifecycles and version everything in distributed version control system such as Git.

So to take the dataset I put together in BigQuery, the first step in the development process with Looker is to create what’s termed “views” for each of the underlying BigQuery table or view sources; in the example below you can see an (abbreviated, there’s more measure definitions in the full view definition file) definition of the “health stats” LookML view, and things to note in this modelling approach are (1) it’s all code, there’s no GUI for developing this all graphically, which suits software engineers just fine as it’s easier from day one to code, version and diff (compare) with other code releases; (2) the dimension_group we define at the start takes care of extracting all the different time periods out of a BigQuery timestamp datatype, and (3) the measure type definitions of sum, average and so on are there to give Looker more flexibility in aggregating (and symetrically aggregating, what we used to call fan-trap avoidance in the old days) measure values.

view: v_health_stats {
sql_table_name: rittman_dw.v_health_stats ;;
dimension_group: date_day {
type: time
hidden: yes
timeframes: [
sql: ${TABLE}.date_day ;;
dimension: pk {
type: string
hidden: yes
primary_key: yes
sql: ${TABLE}.pk ;;
measure: total_active_mins {
type: sum
sql: ${TABLE}.active_mins ;;
value_format_name: decimal_2
measure: avg_sleep_mins_avg_3_days {
type: average
label: "Avg Sleep in Mins"
sql: ${TABLE}.sleep_mins_avg_3_days ;;
value_format_name: decimal_2
measure: avg_weightKg_avg_7_days {
type: average
label: "Avg WeightKg"
sql: ${TABLE}.weightKg_avg_7_days ;;
value_format_name: decimal_2
measure: total_distance_km {
type: sum
sql: ${TABLE}.distance_km ;;
value_format_name: decimal_2

Once you’ve defined all your views, the next step is to define your model made up of what are termed “explores” — think of models as the equivalent of a universe in Business Objects or a repository in Oracle Business Intelligence, and explores as subject areas linking together through sets of views that join on common columns.

connection: "rittman_bigquery"
# include all the views
include: "*.view"
# include all the dashboards
include: "*.dashboard"
label: "Rittman DW"
explore: fluentd_log {
label: "Rittman BigQuery DW"
view_label: "1 - Home Metrics"
join: v_communications {
type: left_outer
view_label: "4 - Comms & Social Media Metrics"
sql_on: ${fluentd_log.date_date} = ${v_communications.date_date} ;;
relationship: many_to_one
fields: [v_communications.post_author,v_communications.post_body...]
join: v_health_stats {
type: left_outer
view_label: "2 - Health Metrics"
sql_on: ${fluentd_log.date_date} = ${v_health_stats.date_date} ;;
relationship: many_to_one
fields: [v_health_stats.total_steps, v_health_stats.total_distance_km...]
join: v_car_trips {
type: left_outer
view_label: "6 - Car Journey Metrics"
sql_on: ${v_car_trips.date_date} = ${fluentd_log.date_date} ;;
relationship: many_to_one
fields: [v_car_trips.fuel_econ_kpl...n]

Truth be told, the net effect of all this isn’t conceptually much different to the enterprise semantic layers I’ve built for Oracle Business Intelligence and other similar systems over the years, and LookML is in many ways really just today’s implementation of what nQuire did back in the late 90’s with Logical SQL and their logical and physical abstraction layers over physical data sources; but again, crucially, Looker’s LookML metadata layer was designed primarily for use through scripting and modern software development practices…

and aligns perfectly with the development lifecycle and tooling used by the types of millenial developers who wouldn’t be seen dead working with the sort of development lifecycle I used to present about a few years ago for an earlier generation of BI tools.

What this gives me once all the development work is complete and data is ready to be analyzed in looks and dashboards is something that looks like the screenshot below; note the semantic model on the left-hand side with subject areas corresponding to the BigQuery views but with further grouping, friendly labelling and organizing into measures and dimension hierarchies and attributes.

As Looker’s SQL queries the underlying BigQuery data store directly rather than an aggregated subset of that data, I can query right down to the lowest detailed-level events in the BigQuery dataset with new events streaming in all the time. Using Looker’s set of tabular, chart and mapping visualization options I can map out where I’ve been over period of time, what I did and correlate that with other data in the dataset, each view within the Looker explore joined together by date so it becomes one big dataset for analysis, as I tweeted about one evening last week when it all came together.


Now it’s all running there’s more I’d like to do with the various Google Cloud Platform APIs including using their new Cloud Vision API to classify and make searchable all the family photos I’ve collected over the years, and there’s lots of Looker features and capabilities I’ve not yet explored in detail including Looker Actions, a concept very familiar from my days working with Oracle BI and its Action Framework. For now thought I thought it would be interesting to share my thoughts on this new platform and the development process from the perspective of a long-term traditional BI+DW developer and architect, feel free to add comments or ask questions if you’re thinking of starting out with BigQuery, Looker or Fluentd in the future.