Every company uses some kind of database to store its data, and in order to survive and thrive, it is crucial to use this data effectively in decision making. However, getting value from our data is a lot easier said than done. In this blog post, we'll discuss how Opendoor takes a modern approach to serving analytics, making sure that everyone on the team gets the data they need to make high-quality decisions.
One way to get insight from data is to hire an analyst who has the skills to query and analyze data and then have people ask the analyst to answer any business questions that they have. The analyst then tries to answer the question by issuing an appropriate SQL query.
As straightforward as this is, we think that answering questions through ad hoc requests is an analytics anti-pattern that causes all kinds of painful problems. Answering a business question using a one-off SQL query is very error-prone, and it is also an inefficient use of an analyst's time.
Another way to get insight from data is to use a spreadsheet. People often go with this approach when an analysis is just too complex to be performed with a single query, or when we want to perform multiple analyses in a single place. For example, you might want to compute the IRR for a time series of cash flows at several different levels of aggregation (by region, customer segment, product, etc.)
As common as it is, this spreadsheet approach to data analysis is another analytics anti-pattern that is rife with problems.
In this blog post, I'll try to illustrate some of the bad things that can happen with these standard techniques, and demonstrate how we've used modern tools and techniques at Opendoor to avoid them.
A simple request
Let's see how things can go wrong, even when trying to answer what looks to be a straightforward question about the business.
Suppose someone has asked us to find out the total volume of orders in cents that we have received from each state. Our company has SQL tables for customers, addresses, and orders as below:
customers +----+-------+--------------------+--------------+ | id | name | email | phone | +----+-------+--------------------+--------------+ | 1 | Alice | email@example.com | 555-123-4567 | | 2 | Bob | firstname.lastname@example.org | 555-987-6543 | +----+-------+--------------------+--------------+ addresses +----+-------------+----------------+-------+ | id | customer_id | street_address | state | +----+-------------+----------------+-------+ | 1 | 1 | 123 Foo St | CA | | 2 | 1 | 456 Bar St | AZ | | 3 | 2 | 21 Jump St | CA | +----+-------------+----------------+-------+ orders +----+-------------+--------------+ | id | customer_id | amount_cents | +----+-------------+--------------+ | 1 | 1 | 200 | | 2 | 2 | 400 | +----+-------------+--------------+
We can see from the tables that we have two customers. One of our customers, Alice, has two addresses, while our other customer, Bob, only has one.
We might try a query like
SELECT a.state , SUM(o.amount_cents) total_amount_cents FROM addresses a JOIN customers c ON a.customer_id = c.id JOIN orders o ON c.id = o.customer_id GROUP BY a.state
giving an output of
+-------+--------------------+ | state | total_amount_cents | +-------+--------------------+ | AZ | 200 | | CA | 600 | +-------+--------------------+
Thinking through the joins, we see that order id 1 has been counted twice and therefore our answer is wrong! In order to get the right answer, we need to know which address is associated with each order. Suppose we have another table as follows:
order_shipments +----+----------+------------+ | id | order_id | address_id | +----+----------+------------+ | 1 | 1 | 2 | | 1 | 2 | 3 | +----+----------+------------+
Then the query
SELECT a.state , SUM(o.amount_cents) total_amount_cents FROM addresses a JOIN order_shipments os ON a.id = os.address_id JOIN orders o ON os.order_id = o.id GROUP BY a.state
would give us the correct output:
+-------+--------------------+ | state | total_amount_cents | +-------+--------------------+ | AZ | 200 | | CA | 400 | +-------+--------------------+
It's great that we now have the correct answer, and when we come back to the requester with our results they are very happy. However, there is still plenty to be troubled about. For one, it wasn't all that obvious that the first answer was incorrect. The query executed successfully, and the output looked pretty reasonable. It was also partially correct, as it gave the correct output for Arizona. Even if we had spot checked that the answer for Arizona was correct, we still wouldn't have noticed the mistake.
What can we do about this? One solution would be to generate some fake data where we know (or think we know) what the correct answer is and then write a unit test check the answer against the output of the query. However we would have to calculate the answer for the fake data by hand, which is time consuming, error-prone, and susceptible to edge cases.
As it turns out, there's no simple answer to bugs like this. They are subtle and tricky to hunt down, and there's no avoiding them except by thorough and careful thought.
Mistakes like this can have very serious consequences as well. It's easy to imagine how you can get into trouble by falsely stating your sales volume. Analytics queries that are used as the source of truth for mission-critical metrics like profit and user growth should be treated with the utmost care. There are a lot of great engineering practices that help with this, such as version control and code review. With a good code review process in place, an observant teammate would have noticed the mistake with our initial query. Unfortunately, core analytics are not always handled with the care that they deserve.
There are other concerns too besides correctness. We've done such a good job at answering the requester's question, that we've got them hooked on data. They begin to come around our desk almost every day with one question or another that they want us to answer. It would be great if we could provide support to analytics users in a more sustainable way. Before we can figure out a solution to this issue, our requester comes back with an even more difficult question, one that will require analysis that goes beyond just a single SQL query.
It's often not possible to do all of the data analysis that you need to do within the SQL query, and you'll have to perform additional operations to the initial query output using another tool, like a spreadsheet.
As an example, suppose we are now asked to find the month-over-month growth of the total order volume (in dollars) for each month by state.
It's easy enough to write a SQL query that retrieves the monthly order volume. We should be careful to use the correct joins to make sure we get the right values. Here's the query in the PostreSQL dialect:
SELECT a.state , DATE_TRUNC('month', o.created_at) order_month , SUM(o.amount_cents) total_amount_cents FROM addresses a JOIN order_shipments os ON a.id = os.address_id JOIN orders o ON os.order_id = o.id GROUP BY a.state, DATE_TRUNC('month', o.created_at) ORDER BY state, order_month
This gives us the monthly order volume by state, but we still need to calculate month-over-month growth for each state. A common way to do this is to download the query output locally, load the data into a spreadsheet and perform the calculation there.
This isn't too difficult, but it still leaves something to be desired. For instance, from the point of view of someone who is using this spreadsheet to make decisions, how can they know for sure that the month-over-month growth calculation was done correctly? Sure, they can check the formula by clicking on the appropriate cell, but if the sheet gets any more complicated, with outputs of formulas becoming inputs to yet other formulas, the logic of the spreadsheet calculations quickly becomes a maze that is extremely difficult to follow. The logic of spreadsheets is buried beneath the surface and cumbersome to find, which makes it very difficult to see if the calculations were done correctly. Billion-dollar mistakes have been made as a result of using spreadsheets with faulty logic that went undetected.
Another problem is that there is no way to tell from the sheet where the data came from. It's not obvious that the first three columns were obtained as the output of a SQL query, let alone what the SQL query itself actually was. If we had used the wrong joins to get this data, there would be no way to know. It is possible to connect a spreadsheet to a SQL database, but even then, we still have the issue that all of the inputs to our calculation are hidden across many different locations.
As the company continues to do business, this spreadsheet will quickly go out of date. If our requester wants an updated report, they'll need to ask someone to repeat the process that they went through to generate the spreadsheet.
One alternative to the spreadsheet workflow that we like to use at Opendoor is to do the analysis in an IPython notebook.
There are several advantages to this approach. One is that the query, post-query calculations, and the final results are all visible in one place. There are other advantages too. The calculations are completely transparent and on display in the form of code. There's also a clear order to the calculations which gives the analysis a narrative flow that's easier to follow.
Doing data analysis in code also adds a lot of flexibility. You can perform manipulations on your data that would be difficult in a spreadsheet. It also opens up the window to lots of entirely new possibilities. You can use larger data sets in your analysis, and perform more sophisticated operations like construct and evaluate predictive machine learning models. Code is also much more amenable to version control than a spreadsheet sheet.
Sharing IPython notebooks has become a lot easier recently, now that Github renders IPython notebooks. Sharing our IPython Notebook analysis is as easy as copying and pasting the notebook code into a gist.
Still, the IPython notebook solution has defects. Even though the logic is transparent, you need to know Python to understand it, which may not be true for less technical users. As with the spreadsheet, the results of the analysis are not updated in real time, and they still require an analyst to manually update them.
Another issue is discovery. Unless someone is explicitly sent a link to the notebook, they will never be able to find it. Finally, modifying and experimenting with the notebook is a technical challenge, and certainly more difficult than playing around with a spreadsheet. Getting an IPython Notebook up and running, installing all of the libraries used in the code, and getting the right credentials to connect to the database is probably too much friction for non-technical users of your reports to bother with.
There are several analytics vendors that help create analytics and serve them to your analytics users in real time. Tableau is one of them. We are heavy Tableau users at Opendoor. We have found it to be a tool with great strengths but with many weaknesses as well. Tableau provides what is perhaps the most expressive interface out there for generating information-rich and attractive data visualizations and it does a great job of allowing us to leverage our visual faculties in our data explorations. In practice, this means we can test out our hypotheses and make a lot of different cuts of the data very quickly.
The main problem with Tableau is that all of the inputs that go into a visualization are scattered across many places. It's the same problem that we encountered with spreadsheets except much worse. There are a bewildering number of places where little bits of logic can be hidden in a Tableau dashboard. There's the data connection, which contains the initial SQL queries and join logic. There are Calculated Fields which contain formulas that define custom quantities as well as the Filters panel which determines which values get included in your visualization. Then, there are the myriad placements of the measures and dimensions as well as values selected from assorted drop down menus.
I've found it very difficult to collaborate with other team members on Tableau workbooks. Once I've created a Tableau notebook and enough time has passed that I've forgotten where all the inputs are hidden, it's hard for me to modify it without feeling like I'm about to chop my arm off by accident.
Tableau workbooks are XML files, which are difficult to work with directly. The XML format makes it hard to see what's going on in the workbook in text, and makes it difficult to manage Tableau workbooks using version control.
Looker is another analytics provider that facilitates the creation and distribution of business analytics.
Looker preserves the declarative style of SQL yet adds new analytics capabilities. It allows you to write code that is more modular than raw SQL, which makes it easier to understand and maintain your code. Looker also provides the ability to construct deep and richly-detailed analytics dashboards that are served in real time. All of the inputs that go into Looker's analytics are specified upfront in YAML-style code, from the initial SQL queries, to intermediate calculated quantities, to the filters and display elements that apply to the final results. In contrast with spreadsheets and Tableau, this means that all of the calculations are readily inspected and verified and that projects interact well with version control. It has many of the benefits of the IPython notebook approach, with the additional benefits of being approachable to less technical analysts. For these reasons, I like to use Looker for analytics that are mission critical and need to be reliably reported for the foreseeable future.
Looker has its quirks though. Their web interface isn't a thrill to work with. It's time consuming to navigate from place to place, which puts a significant hit on developer productivity. The interface provides a validation feature that performs some checking on your code, but it doesn't check the validity of your SQL queries, and a lot of the errors that could be detected during validation aren't reported until query time, which hampers productivity even further, and makes it all too easy to break production analytics.
There's no place like home
In order to have analytics that are truly real-time, what you ultimately need is a central location for your analytics users to come and discover reports that are of interest to them. A lot of analytics vendors like Looker and Tableau provide websites that ostensibly serve this purpose. However, forcing your analytics users to set up yet another account to log in to and learn an unfamiliar new web interface is a high price to pay. Often what ends up happening is that dashboards are posted to the vendor website but no one uses them because no one ever spends time navigating the new and unfamiliar website.
At Opendoor, we have an internal tool called the Admin Dashboard that everyone in the company is familiar with. We decided that the best way for people to discover analytics that they might be interested in is to have a page within the Admin Dashboard for serving analytics. This way, people are able to get access to real-time data in a context that they are already familiar without having to setup or log in to a new account.
To accomplish this, we use a CORS proxy to allow the client to request a list of all available dashboards from Looker, and we embed Looker dashboards into the Admin Dashboard using iframes. This way, an analyst can create a dashboard using Looker that is immediately discoverable and viewable within Opendoor's Admin Dashboard without needing to spend time with an engineer.
Bring the data to the team
We're in pretty good shape now. We have a central location for people on the team to discover analytics that are relevant to them without having to wade through an unfamiliar website. Furthermore, all of the data is updated in real time without requiring an analyst to do any repetitive manual work. However, there is still no guarantee that people will actually go look at the data. People get busy, and they can go weeks at a time without thinking about data, and that's just not good. Sometimes a better approach is to have the data come to the team, rather than require the team to seek out the data for themselves.
Since Opendoor primarily uses Slack to communicate, the solution that works best for us is to automatically and periodically send out reports to various Slack channels. We first tried to do this by connecting up Looker's email scheduling functionality to Slack's email integration, but it wasn't possible to get the reports to look the way we wanted them to. As a result, we decided to roll our own automatic reporter app, which has worked out well.
To do this, we built a simple Python app that uses the APScheduler library to periodically broadcast reports to Slack on a customizable schedule. This pattern works well for email reports too, not just Slack. Even if you don't have a ton of technical skill, this is not as difficult to get up and running as you might think.
Serving up quality, impactful analytics comes down to finding the right tradeoffs between rigor, productivity, presentability, performance, usability, and discoverability. There are now a wide variety of powerful tools out there that allow us to find good solutions more quickly and with less effort than ever.