The city of Charlotte publishes a dataset with details on traffic stops conducted by police officers. As of this writing, the available data spans only January to October 2016; nevertheless, it includes over 67000 traffic stops. For each stop, the provided details include the month of the stop, the reason for the stop, the races and genders of the officer and the driver, the officer’s number of years of service, the driver’s age, the police division, and the outcome of the stop – i.e. a warning, a citation, an arrest, etc.
As someone with rather diverse family ties in Charlotte, my discovery of this dataset intrigued me. I wondered what insights might be gleaned, and whether there is any clear statistical or predictive relationship between the outcome of a traffic stop and other factors mentioned above. This is a brief report of what I’ve seen so far.
The following graph shows the distribution of drivers’ ages, and of the reasons for traffic stops.
The distribution of stopped drivers’ ages does not seem surprising to me, but the graph above seems to show remarkably little change in drivers’ reasons for being stopped. However, the following graph better highlights how some of those reasons do become more and less prevalent as a function of driver age.
More interestingly, there is a striking relationship between the age of a driver and how likely that driver is to be issued a citation or arrested during a traffic stop, as seen here:
On the other side of the coin, we can see a relationship between an officer’s years of service and his or her rate of citations or arrests:
Below, I organize all traffic stops based on the genders of the driver and officer involved.
|Female||2495 (41.08% cited/arrested)||25848 (43.42% cited/arrested)|
|Male||3221 (42.78% cited/arrested)||36009 (43.30% cited/arrested)|
There are slight variations in the rates of citation or arrest between these four traffic stop situations. However, only the difference in the rate of citation or arrest of female drivers between male and female officers rises to the level of statistical significance (p value 0.024, if no correction is made for multiple comparisons).
Below is a visual representation of how likely a driver from each of four major racial groups is to be cited or arrested during a traffic stop conducted by an officer from each of four similar groups. (The racial categories given in the dataset for officers are not precisely equivalent to those given for drivers.)
As far as I can see, nothing particularly troubling seems to stand out in this visual summary. However, one detail that may be worth noting is that according to the 2010 census (via Wikipedia), Charlotte’s racial makeup is about 45% white and 35% black, while the races of the stopped drivers in this set of data are quite different: 41% white and 54% black.
Since this dataset seems to need little cleaning, it is a fairly simple matter to train machine learning models on it, to see how well the outcome of a traffic stop can be predicted from the other available details about the stop. A gradient boosted trees classifier achieves an accuracy of about 0.65 on held-aside test data from the set. The ROC curve for the classifier is shown to the right.
Based on the classifier mentioned above and a few other types of classifiers I have experimented with, the data features that appear to have the most importance in predicting outcomes are the driver’s age, the officer’s years of service, the police division, the reason for the stop, and the month* in which the stop took place. Features such as the race and gender of drivers and officers have consistently low importance in all the classifiers I have trained, which might be seen as good news.
* I would love to see more detailed information about the exact date and time of each traffic stop. (I wonder if there’s any evidence in Charlotte for the common belief that citations are more common near the end of the month…)
I might post my code for this little exploration on github at some point. If you’d like to see this expedited, send me a note and I’ll get it done much more quickly.