October 23, 2019

Linking user databases and usage analytics, to get metrics on your service’s group accounts rather than individuals

Using Segment.io’s Data Warehouses, AWS RDS Postgres and Redshift, and AWS Data Pipeline

The problem

You’re part of a business who want to make data-driven decisions about how to improve their SaaS product. It’s your job to provide great analytics to help with that. But wait! your product is not B2C, it’s B2B and all of your most important metrics are things like

  • % of companies on paid plan
  • % of companies with daily usage by at least one administrator

and other quite Piratey things like that.

Oh no! your standard product analytics in Google Analytics or Heap or whatever only show you what single persons did, not what groups of people did.

How do you proceed?

First, our setup

(so you know if this is relevant to you)

We use

  • an AWS RDS Postgres database to hold user data: who they are, which companies they’re part of, their roles in those companies, etc.
  • Segment.io to receive usage analytics (from front and backend code), who then format and send on the events to whichever cloud analytics we like that week (say, Google Analytics or keen.io or heapanalytics.com).

All pretty normal.

But what about the groups magic?

Somehow you need to link the “current state” info about people and groups from Postgres to the “actions” info from analytics.

If you’re using a similar setup to us, segment.io have two awesome things to help you out:

  1. Segment’s .group() call to link a user with an identified for the (or each) group/company that the user is associated with
  2. Segment’s just-launched Data Warehouses which let you get all your analytics into a nearly-no-management-needed Postgres or AWS Redshift data warehouse

All about that warehouse, no trouble

In fact for us, the new Data Warehouses have pretty much eliminated our need for using .group() at all, because we can just get our analytics data and our user data near enough to each other that single SQL queries can join between them.

One option is that you let Segment pipe your analytics into the same Postgres database as your user data (in its own table space hopefully, folks).

Another (and our chosen) option is to let Segment pump analytics into Redshift, and then to use an AWS Data Pipeline to also copy just the user data you need for analytics into the same Redshift database. This felt better to me, because you can (likely) avoid copying any tables to Redshift with personal information in. For example: if you only need to be able to measure “which companies had an administrator login every day last month”, you might only need to copy your companyperson relationship table, which might look like this:

<div class="code-embed"><div class="text-block-27">person_uuid  company_uuid  role_in_company<br>
------------------------------------------<br>
aa-11-aa     ff-55-ff      ADMINISTRATOR<br>
ab-12-df     fd-98-cv      MEMBER<br>
...          ...           ...</div></div>

This data is no more sensitive than what you’re already sending into your analytics, which should make you feel warm and fuzzy.

Setting it up

Setting up Redshift is fairly straight forward if you’re already using AWS, and getting Segment to talk to it is a doddle.

Getting data out of your RDS Postgres database and into Redshift was a bit trickier for me, because AWS Data Pipelines doesn’t have a readymade template for you. (You could probably just do some shortcut hack using pg_dump and pg_restore, but data pipelines are fun :| so let’s continue.)

You could start your data pipeline with this MySQL template pipeline and edit it — that’s what I went for. Getting that to work with Postgres instead of MySQL just requires adding an extra Data Node of JdbcDatabase type, and then having a second data node of SqlDataNode type, with the Jdbc type as its input. Like this:

<div class="code-embed"><div class="text-block-27">{<br>
 "objects": [<br>
     ...,<br>
     {<br>
       "schedule": {<br>
         "ref": "DefaultSchedule"<br>
       },<br>
       "database": {<br>
         "ref": "DatabaseId_poiuy"<br>
       },<br>
       "name": "SrcRDSPostgresTable",<br>
       "id": "DataNodeId_asdfg",<br>
       "type": "SqlDataNode",<br>
       "table": "#{myRDSTableName}",<br>
       "selectQuery": "SELECT * FROM #{myRDSTableName}"<br>
     },<br>
     {<br>
       "connectionString": "#{myRDSJdbcConnectStr}",<br>
       "databaseName": "#{myRDSDatabaseName}",<br>
       "*password": "#{*myRDSPassword}",<br>
       "name": "RDSPostgresStage",<br>
       "id": "DatabaseId_poiuy",<br>
       "type": "JdbcDatabase",<br>
       "jdbcDriverClass": "org.postgresql.Driver",<br>
       "username": "#{myRDSUsername}"<br>
     },<br>
     ...<br>
 ],<br>
 ...</div></div><br>

The Create Redshift Table activity in that template also assumes you’re using a MySQL database. I took the easy way out of that and just deleted the create table activity altogether, and added some simple

<div class="code-embed"><div class="text-block-27">CREATE TABLE IF NOT EXISTS whatever (...)</div></div>

sql to the CreateTableSQL parameter of my Redshift source, copied from the definitions in my Postgres database.

It took me about a day of experimenting to figure out AWS data pipelines and get one up and running how I needed it.

And then just like magic…

You can fire up your PostgreSQL app of choice (mine: https://www.jetbrains.com/dbe/) and run some lovely Joins across your data:

<div class="code-embed"><div class="text-block-27">SELECT company_uuid<br>
FROM userdata.companyperson AS userdata<br>
INNER JOIN analytics.pressed_big_button AS analytics<br>
ON userdata.person_uuid = analytics.user_id</div></div>

And there you have it: you’re just a few sql queries away from those account-level analytics you need for your SaaS metrics.

And then it’s probably best to leave the bad interpretation of your analytics to somebody else. (c.f. Dilbert)

Thanks

I guess I’m bad at drawing, so:

Person icons made by http://www.freepik.com, http://www.flaticon.com/authors/elegant-themes, http://www.flaticon.com/authors/icomoon from http://www.flaticon.comwww.flaticon.com is licensed by http://creativecommons.org/licenses/by/3.0/ CC BY 3.0

Company logos made by http://logotypemaker.com | licensed by http://creativecommons.org/licenses/by/3.0/ CC BY 3.0

Subscribe to our newsletter

Subscribe