Let’s say you want to know about properties in Phoenix. The first thing you would do is look around for some data about these properties, and you would indeed find a lot. You would find realtor listing databases, tax assessor records, recorder office data as well as data from an endless array of third-party providers, and pretty quickly, you would find yourself inundated with heaps and heaps of data.
Next, you would try and do something useful with the data, which is where things get hard.
At times there will be a certain house that you are particularly interested in, and so you would very naturally want to be able to issue a query to your data hoard to the effect of “give me all the data we have on house such-and-such”. This is the thing that we are going to talk about.
One issue that we need to work out is how to identify a house. It turns out that there are several systems for identifying properties. The system that everyone knows and uses is mailing address (e.g. 123 Main Street, Phoenix AZ, 85037) but there are others. For example another identity system, used by the tax assessors, is the assessor’s parcel number. There are other identifier systems too, such as address id numbers that are produced internally by third-party data sources.
Let’s call any systematic way of assigning a unique string or number to each home an identity system or just system for short.
Now we can express our goal more concretely: Assume that all of our data is stored in several different Postgres tables. Each row of data in each of our source data tables has house ids in one or more identity systems. Given a house id in one identity system we want to find all rows referring to this house across all tables and all identity systems.
Our objective is a join table with the following schema:
table_name # name of source data table table_id # id of row in source data table sys_id_1 # id of home under each identity system, possibly NULL sys_id_2 ... sys_id_n group_id # index id common across all records that correspond to the same address
At this point we were really excited because we had this very concrete problem and there was this sense that we could take some of the algorithms we had learned in school and finally put them to use. One algorithm, the union-find algorithm, seemed to fit our situation particularly well.
To give some sense of what this approach entails, let’s go into a bit more detail.
In our case, the union-find algorithm constructs a bi-partite graph with one set of nodes corresponding to the rows of the source data tables (blue circles), and the other set of nodes corresponding to the house ids of the rows under the various identity systems (orange circles). The data structure is designed so that two operations can be efficiently performed on the data structure: union, which links two nodes together, and find, which maps any element to a representative element in its connected component.
Python code for running this process might look like this:
disjoint_set = DisjointSet() for table in source_tables: data = download_data_from_db(table) for row in data: for (sys_name, sys_id) in row.sys_ids: disjoint_set.add(row, (sys_name, sys_id)) group_ids = [disjoint_set.find_group_index(elem) for elem in disjoint_set.elems] write_group_ids_to_db(disjoint_set.elems, group_ids)
The disjoint-set data structure performs well in linking nodes together and identifying the connected components, O(n log n) to insert n nodes into the data structure and to determine a group id for the connected component of each element. However, there are two major issues that ultimately convinced us to develop a different approach, both of which can be seen in the code above. First, there was a large impedance mismatch between how our data was represented in the database and the disjoint-set data structure we wanted to use. This mismatch translated into a huge I/O bottleneck where millions of rows of data are downloaded from the database to construct the disjoint-set data structure, and then millions of results of the final calculation are inserted back into the database. The other problem was that constructing the disjoint-set data structure itself required massive amounts of memory and would not readily scale as Opendoor expands to more real estate markets.
We realized that both of these problems would go away if we could express all of our data transformations in terms of SQL queries on the Postgres database. This is possible due to an incredibly powerful feature of Postgres known as window functions. A window function performs a calculation across a set of rows which are in some sense related to the current row.
To demonstrate this approach, suppose we are trying to link the following data with three source tables each with two rows of data and two identity systems (mailing address and APN). We start off by creating a table, called
address_groups with one row for each row in each table, containing the house id for each identity system, in this case mailing address and apn as well as a group id. The group id is initially set to be the row id of the table.
id table_name table_id mail_addr apn group_id 1 foo 1 123 Main St NULL 1 2 foo 2 456 Pear St NULL 2 3 bar 1 NULL 41 3 4 bar 2 NULL 97 4 5 cor 1 456 Pear St 41 5 6 cor 2 123 Main St 97 6
The group id will converge to an index for the record groups over the course of several transformations. We start off with the mailing address. Using a window function, we update the group id to be the minimum group id of each mailing address, leaving rows with
NULL mailing address alone.
The query to do this is:
CREATE TABLE grouped_by_mail_addr AS (SELECT id, table_name, table_id, mail_addr, apn, CASE WHEN mail_addr IS NOT NULL THEN MIN(group_id) OVER (PARTITION BY mail_addr) ELSE group_id END AS group_id FROM address_groups)
This results in the following intermediate dataset:
id table_name table_id mail_addr apn group_id 1 foo 1 123 Main St NULL 1 2 foo 2 456 Pear St NULL 2 3 bar 1 NULL 41 3 4 bar 2 NULL 97 4 5 cor 1 456 Pear St 41 2 6 cor 2 123 Main St 97 1
Next we do the same thing, but with the APNs, resulting in:
id table_name table_id mail_addr apn group_id 1 foo 1 123 Main St NULL 1 2 foo 2 456 Pear St NULL 2 3 bar 1 NULL 41 2 4 bar 2 NULL 97 1 5 cor 1 456 Pear St 41 2 6 cor 2 123 Main St 97 1
If we have error-free data (lol) then with some thought, you can convince yourself that the data will be correctly linked and indexed after one grouping pass on each of the system id columns. In the case above we do have perfect data and we have correctly identified rows 1, 4, 6 and rows 2, 3, 5 as corresponding to the same house.
Now, to find all of the data corresponding to 123 Main St, we simply look up its group id (1 in this case), query for all rows in this table with that group id (rows 1, 4 and 6), and then look up the data in each of the source data tables (ids 1, 2 and 2 of foo, bar and cor respectively).
Of course data is never perfect, and so there will often be situations where some of the system ids are incorrect. In this case it will take multiple grouping iterations of each of the system ids for the group ids to converge to the correct values. In the final result, one ends up with some bad groups that have distinct system ids within the same identity system, which amounts to conflating several distinct homes into one. Luckily, this happens in a minority of cases which we monitor daily with diagnostic queries on the grouping table.
We’re constantly improving our indexing algorithm over time by tracking the bad groups, identifying the bad system ids that are incorrectly linking homes together, and compensating for them in subsequent runs of the grouping algorithm.