Update on predicting Boston health code violations

New Approaches

As you might suspect from the title, this is a follow-up to a project I described previously. I have reworked my previous approach, and am now using the City of Boston’s complete and detailed records of health inspections, available at data.cityofboston.gov. As before, I store a copy of the data locally. However, the dataset is continuously updated, and since it is accessible via a SODA API with an SQL-like query language, my redesigned code can selectively pull only new inspection records each time it is run. (The same goes for the dataset of 311 service requests, available at this link.)

As noted before, one issue with the City of Boston’s inspection records is that they lack latitude and longitude information for many inspections. I have overcome this issue by taking advantage of the Google Maps Geocoding API in order to convert street addresses into latitude/longitude coordinates.

The model features I am now using are largely the same as before (i.e. urban conditions near each restaurant in the weeks leading up to the inspection), but there are some new additions. These new features are drawn from the specifics of past inspections, such as the violation codes given for past failures. I also use the raw text comments written during the previous inspection, processed with TF-IDF vectorization followed by dimensionality reduction. Then a selection process is applied to the whole feature set, reducing it by about a half.

Lastly, I’ve changed my methodology for splitting data into training and test sets, to better reflect the real-life usage of a predictive model in a context like this one. I now assign all data before a certain date to the training set, and the remainder (that is, the most recent data) to the test set, not to be used in training the model. This parallels the situation in which past information must be used to predict a truly unknown future. Note that under this new arrangement, it should not be surprising to find that the model’s performance has decreased somewhat compared to previous results, since its task is now more akin to extrapolation than interpolation.


In the most recent run of my redesigned code, a gradient boosted trees model was trained with data from January 3, 2012 to November 25, 2016. The model was then tested on 60 days of subsequent held-aside data, which at that moment was available up to January 26, 2017. The accuracy on the testing data was 0.804.

To get a more human-comprehensible sense of the model’s performance, let’s take a focused look at only the most recent 7 days of data. From January 20 through 26, there were a total of 93 inspections. The model’s accuracy on those outcomes is 0.763. More specifically, among inspections that actually failed, the model assigns a probability of failure of about 0.74 on average. Among inspections that actually passed, the model assigns a failure probability of about 0.34 on average. The ROC curve, based on the most recent 7 days of data, is shown here:


The geographic distribution of these 7 days of results can be seen in the map below. Click for an interactive version. Inspections are color-coded based on the model’s performance: passed inspections that were correctly predicted to pass are marked green, failed inspections that were correctly predicted to fail are marked red, inspections that failed but were predicted to pass are marked yellow, and inspections that passed but were predicted to fail are marked blue.


I will post my updated code for this project on github at some point. If you’d like to see this expedited, send me a note and I’ll get it done much more quickly.


October 2017 Addendum

Not long after writing this post, I started a new position in corporate data science. This has been a wonderful opportunity to tackle many problems both interesting and valuable, a few of which have borne some resemblance to the problem described here. The downside is that I currently have much less time for recreational or amateur data science.