In a previous post, we discussed the need for understanding the confidence of our housing price predictions. One strategy for doing this is building a confidence model. In this post, we will describe one possible solution and the way we use it at Opendoor. Specifically, we will describe an error model $$\hat{g}$$ that estimates the prediction errors from a housing valuation model $$\hat{f}$$ that predicts housing prices.

Overall error vs. per-unit error

Most machine learning practitioners have a good understanding of the average error of their predictions. But these are global numbers, and subsets of the data may have much higher or lower error rates. We could "slice" the data in various ways and calculate per-slice error metrics (e.g., per zip code), but what is the right level of slicing?

Since we want fine-grained error estimates on each unit, we could condition on the unit (in our case, a house with characteristics $$x_i$$) and estimate error. This sounds reasonable in theory but has some practical challenges.

Expected prediction error as a measure of per-unit confidence

Let's take a look at expected prediction error (EPE), since it will help us develop some intuition. ESL (page 223) defines the expected prediction error of a house $$x_0$$ as $$\text{Err}(x_0) = \mathbb{E}[(Y - \hat{f}(x_0)) ^ 2 | X = x_0].$$

$$X \in \mathbb{R}^p$$ is the random variable encapsulating our housing features (square footage, location, number of stories, $$\ldots$$), and $$Y \in \mathbb{R}$$ is the random variable modeling the home's value. The two are related via the joint distribution $$\text{Pr}(X, Y)$$. This means that a house with given features can sell for a range of prices, all with varying degrees of likelihood. We can write $$Y = f(X) + \epsilon$$ , where $$f$$ is the true relationship between housing features and the price, $$\mathbb{E}[\epsilon] = 0$$, and $$\text{Var}(\epsilon) = \sigma^2$$. We draw an i.i.d. sample of size $$n$$, from which we produce an estimate of the relationship between the variables, $$\hat{f}(x)$$, which is our fitted valuation model.

For houses with features $$x_0$$, this represents the average squared amount by which our estimates vary from the true values of the houses. We can break down the expected prediction error into three components in a bias-variance decomposition: $$\text{Err}(x_0) = \underbrace{\sigma^2}_{\text{irreducible error}} + \underbrace{\text{Bias} ^ 2(\hat{f}(x_0)) + \text{Var}(\hat{f}(x_0))}_{\text{model + fit contribution}}.$$ Even if we know the true relationship $$f$$, then for a fixed feature space $$\mathcal{X}$$ we can't do any better than the irreducible error due to inherent stochasticity in the relationship between $$X$$ and $$Y$$. The other two terms are properties of our model and the fitting procedure. Given enough data and a sufficiently flexible model, we should be able to make them smaller. The bias measures whether our prediction is centered around the truth, and the variance measures how much the prediction changes around its average over the different possible training datasets.

Expected prediction error in linear regression

To develop intuition, let's assume that the true relationship between $$X$$ and $$Y$$ is linear and that we estimate $$f(X)$$ with linear regression: $$\hat{f}(X) = X ^ T \hat{\beta}$$. The bias term in the expected prediction error disappears, and we can write the average prediction error over the sample set as $$\frac{1}{n} \sum_{i = 1}^{n} \text{Err}(x_i) = \underbrace{\sigma ^ 2}_{\text{overall irreducible error}} + \underbrace{\frac{p}{n} \sigma ^ 2}_{\text{average variance of predictions}}.$$

The variance increases in the number of parameters we have to estimate and decreases as a function of the sample size. Above, we have assumed that $$\sigma^2$$ is constant, that our setting is homoscedastic. But we can easily treat the heteroscedastic case as well.

As another simplification, let's assume that we have a single categorical feature indicating to which group an observation belongs. Group $$i$$ has irreducible error $$\sigma_i^2$$ and $$n_i$$ observations. Now the expected prediction error is $$\text{Err}(x_i) = \underbrace{\sigma_i ^ 2}_{\text{irreducible error for house with features $$x_i$$}} + \underbrace{\frac{\sigma_i ^ 2}{n_i}}_{\text{variance of prediction for house with features $$x_i$$}}.$$ This is a nice, clean separation between two different concepts, the irreducible error for a house and the variance of our prediction.

The irreducible error is mostly driven by the features that are available to us. If you believe in a deterministic universe, then if you have total information (including what's going on in the mind of a prospective buyer), you can predict with certainty what a house will close for (0 irreducible error). The sample size doesn't play a role here, even if the house in question is one-of-a-kind.

By contrast, the variance depends on the modeling technique, learning parameters, sample size, and the irreducible error. Here, the fact that the house is one-of-a-kind will certainly contribute to the variance.

We can imagine four hypothetical housing archetypes that are easy/hard to value for different reasons:

• The McMansion: a common luxury house that could have high variation in interior upgrades and amenities
• The Custom Celebrity House: a rare luxury house that is notoriously hard to value because it is unique
• The Model Match: a house with many nearby neighbors that share the exact same layout and that turnover quickly
• The Small Infill Development: a house with a few neighbors with similar layouts nearby, but in an area where the number of neighbors is constrained by the small amount of available land
Archetype $$\sigma_i$$ $$n_i$$ Expected Prediction Error
McMansion high high high
Custom Celebrity high low very high
Model Match low high low
Infill low low medium

While the setting above is admittedly simple, the basic intuition is largely generalizable. In practice, one will have to address bias, mixtures of common and sparse features, missing data, and all the complexities of real data. In our simulation below, we will generate data with low/high irreducible error and small/large sample sizes. We will demonstrate how to prospectively identify subsets of data with low expected prediction error. In addition, we will show how to improve these estimates by paying attention to variance.

Estimating the expected prediction error

The EPE seems like a pretty reasonable measure of confidence. It is low when we have high confidence and vice versa. It is also defined for each house $$x_i$$, so we can make different decisions for each house. However, it is defined in terms of an expectation with respect to the joint probability distribution $$\text{Pr}(X, Y)$$, which we don't know. This is a common problem in statistics: many elegant theoretical constructs can be hard to estimate in practice. We could try to calculate the average for each unique $$x_i$$ from houses sharing the same attributes. However, there might only be one house for each $$x_i$$. In fact, there's certainly only one detached house at a given latitude/longitude. Fortunately, we can apply our standard modeling techniques to this problem and pool information over neighboring datapoints.

Modeling squared residuals estimates expected prediction error

In our original prediction problem, we wanted to model the conditional expectation $$\mathbb{E}[Y | X = x_0]$$. We did this by fitting a regression model to the observed pairs of data $$(x_i, y_i)$$. Now, we'd like to model $$g = \mathbb{E}[(Y - \hat{f}(x_0)) ^ 2 | X = x_0]$$. We can do this by fitting a regression to the pairs $$(x_i, (y_i - \hat{f}(x_i))^2)$$. Put another way, we can estimate the expected prediction error by modeling the squared residuals. Our fitted error model $$\hat{g}$$ would produce a confidence score $$\hat{g}(x_i)$$ that represents the expected squared residual for our housing price estimate $$\hat{f}(x_i)$$.

A visual walkthrough

We will now share an example by simulating prediction errors to demonstrate how they can be predicted. To start, let’s visualize the expected prediction errors (which we won’t know in practice) for the four different settings described above. This illustrates what we know from the bias-variance decomposition: small sample size and high irreducible error result in higher expected prediction error. Our objective is to come up with a criterion to identify houses belonging to the lowest EPE group.

The figure below shows the expected prediction errors as vertical bars. Each pane represents one pair of irreducible error level and sample size corresponding to our house archetypes.

In practice we will estimate EPE from sample data. These estimates of the expected prediction error will have some variance, so we add the distributions of these estimates to the plot in the figure below. Of course, in practice we don’t know these distributions.

When we estimate the expected prediction error for a set of real houses we are effectively sampling from these distributions. For example, imagine that we observe 10 sets of McMansions over 10 zip codes, and we want to estimate the expected prediction error in each. We can illustrate this process by drawing 10 samples from the McMansion distribution. The draw represents our estimated EPE for a cohort of houses in that category. In the figure below, we illustrate a representative sample of 10 draws from the distribution in each panel.

At Opendoor we use our predictions to buy and sell houses. We’d like to take some action on houses where our expected prediction error is low. A reasonable, but arbitrary, definition of "low-error houses” could be cohorts that have estimated EPE below 0.7.

Unsurprisingly, all 10 houses in the low SD / high N group meet this criterion. However, many cohorts from the other groups are also appear to have low error by the definition. This is distressing, since in practice we could overestimate the confidence we should have in our predictions. In particular, we are especially concerned about the high SD / low N group, which contributed 4 houses with estimated EPE below 0.7. Tightening our criterion won’t help much here because there are still several points in the high SD / low N group with low estimated EPE. This is fundamentally a problem of variance.

Each of the estimated EPEs is based on sample data, so we know that there is some associated uncertainty based on sample sizes. To account for this we can add confidence intervals to our point estimates. In order to avoid overplotting, we show the confidence intervals at different levels of the y-axis, but we connect them to the original points. As we would hope to see, the confidence intervals are especially large in the small N settings. We can then replace the previous definition of low-error with a more conservative one: having an upper confidence bound below the threshold of 0.7. Applying this new rule to our example, we see that we have rejected all the high-variance estimates, and 8 / 9 of our proposals are now in the lowest-EPE group. Our lesson here is that we need to be mindful of the variance of our estimates if we want to minimize the false positive rate.

Bootstrapping to account for the error model variance

In the preceding example we have two models: a housing price prediction model $$\hat{f}$$, and a confidence model $$\hat{g}$$ that estimates the expected prediction error of $$\hat{f}$$. But what about the error from the predictions of $$\hat{g}$$? We would like to account for this second source of error when assessing our predictions.

Our next step is to create an error model that has some notion about the variance inherent in its own predictions. One simple way to do this is to use the bootstrap. We can generate $$B$$ resampled versions of the world and see how the error model's predictions vary. In particular, if we wish to be conservative, we could take some upper confidence bound of the bootstrapped error estimates (e.g. the maximum, some high quantile, or the mean plus a standard deviation). We don't just want the best estimate of the error -- we want to know how high it conceivably could have been under different realizations of the world.

Simulation setup

Our earlier plots showed a cartoon world meant to illustrate the concepts, and we didn't fit any models. Now, we will generate more simulated data and fit the models discussed above.

Here are what the data-generating parameters look like:

$$n_i$$ $$\sigma_i$$ Replicate
10 0.05 1
40 0.05 1
10 0.1 1
40 0.1 1
10 0.05 2
40 0.1 40000

And this is how we intend to make predictions and update our models:

Model subscripts indicate that they have trained on that number of observations. We are training the models in an online fashion, because that is a close match to the reality of housing sales, which are observed sequentially over time. Model superscripts indicate the bootstrap world to which they belong.

We're actually cheating a bit here, because we are not bootstrap resampling at the top level of the process. Specifically, the bootstrapped error models see the same predictions and hence same residuals in the different bootstrap worlds: the resampling treats the housing price predictions as fixed. In reality, as the housing prediction model trains on different realizations of the world, it would make different predictions. So we are actually underestimating the variability of the process, but the code is simpler because we can make easier use of the online bootstrapping capabilities of Vowpal Wabbit.

Results

Here are the abridged results, where an outlier is defined to be a prediction with absolute error greater than .15:

Overall error metrics vs. error model best decile and conservative error model best decile
Description NumInBucket AvgAbsoluteError %Outliers
Overall 4e+06 0.074 11.56
Error Model Best Decile 4e+05 0.055 4.453
Conservative Error Model Best Decile 4e+05 0.051 2.582
Overall error metrics vs. error model best decile and conservative error model best decile (cont.)
Description %HighN_LowSD %HighN_HighSD %LowN_LowSD %LowN_HighSD
Overall 40 40 10 10
Error Model Best Decile 64.83 10.58 18.17 6.421
Conservative Error Model Best Decile 80.19 6.354 11.25 2.207

Let's imagine that we have trained these two error models, and new houses are streaming in, waiting for us to predict their value. Without an error model, we make on average an absolute error of .074, and for 11.6% of houses we are off by more than .15. With the basic error model, we can pick out the best decile (10% of houses with lowest expected error). For these houses, we are off by .055 on average, and 4.5% of these houses are outliers. With the conservative error model based on bootstrapping, our average error is .051, with only 2.6% outliers.

Moreover, the conservative error model identifies the highest percentage of houses (80%) in the group that should have the lowest observed error (both high sample size and low irreducible error). And the percentage in the highest-actual-error bucket is down to 2.2% from 6.4%.

Other approaches

We have presented one approach to estimating confidence, based on residual modeling and bootstrapping. But there are many other possible approaches. For instance, we could imagine fitting two quantile regression models, e.g. a 25%ile prediction and a 75%ile prediction, and looking at the difference between the two. This interquartile range is related to the variance $$\sigma^2$$. In fact, it is approximately $$1.35 \sigma$$ for the Normal distribution. These L-estimators are often known for being robust but have low efficiency.

Another approach is taken by the ad click prediction team at Google. They primarily care about the computational efficiency of the confidence estimates. Their online learning system maintains counts of the number of times it has seen each feature in order to implement per-feature learning rates. They use these counts to come up with a confidence score. In fact, they show that this score is an upper bound for the amount by which their log-odds prediction for an observation changes by learning on that observation, which is a measure of the stability of the estimate.

ML on ML: What could go wrong?

Some of the methodology advocated in this post should have set off some alarm bells. As mentioned previously, we bootstrap resampled "at the wrong level." But this was done as an approximation because it involved less code and still served to illustrate the main points.

Another possible red flag is that we are training a machine learning model on the output of another machine learning model. Granted, in our online framework, we took special care to not "leak information from the future": don't update any models for an observation at time $$t$$ until after all predictions have been made. However, this can still be a bad idea for reasons of machine learning technical debt.

Specifically, we have an unstable data dependency. The error model takes as input predictions from the valuation model, which can qualitatively change in behavior over time. This can be very bad when the valuation model changes without considering downstream ramifications (i.e. on the error model). We also have a correction cascade, where we have created a system dependency on the original valuation model. This can result in a deadlock, where the coupled machine learning systems are in a poor local optimum: it can be difficult to improve both systems simultaneously.

We run all potential changes through our model experimentation framework and generate numerous backtested metrics on both the valuation model and the error model. We also haven't experienced a deadlock between these two models because any improvements in the valuation model tends to drive improvements in the error model. For instance, if overall error decreases, then the error of our top-performing decile tends to decrease as well. And features that perform well in the valuation model also typically benefit the error model.

We are always looking for great new people to join our highly talented team. This post was a collaboration among members of the Data Science team at Opendoor. We'd also like to acknowledge Brad Klingenberg, Leo Pekelis, and JD Ross for reading drafts of this post.

The simulation code is located here.