Dabbling in data science competitions

Introduction

Most of the time, I find it hard to get too excited about data science competitions, probably because they represent such a limited view of data science.

nds

However, I was recently wanting to test out a few new ideas I had for improving a model at work – a model I’ve alluded to before and will revisit to some degree in a future post. Around that same time, I was notified about a new CrowdAnalytix contest, and decided to take advantage of it to try out ideas. The contest had a very short duration, compared to others I’ve seen; I think less than two weeks passed from the time I began to work until the deadline for final submissions. I actually enjoy short deadlines like this, finding that I’m very motivated (in a positive way) by time constraints. I ended up coming in 12th place out of about 750 contestants. An outline of the code I used is on github, but I don’t really want to focus on it here, because around the time I was wrapping up that contest, I began working on another one, hosted by DrivenData, which happens to be my favorite data science competition website.

The data for the DrivenData problem had a far more interesting and challenging structure than the straightforward CrowdAnalytix task. Without getting into too much unnecessary detail, it was a classification problem, with performance measured by logarithmic loss. The classification task was to be performed at the level of households, and household-level labeled data was provided for model training. However, as an interesting complication, a separate set of individual-level data was also provided, with a variable number of individuals belonging to each household. Taking maximum possible advantage of this second level of data required an extensive and enjoyable process of wrangling the two levels into a usable aggregate form.

Aside from the much larger amount of data processing required, my modeling approach in this second contest had a lot of similarities to the earlier one, since I worked on them at almost the same time. But in this later task I did make a couple of more sophisticated choices, which seem to have paid off. I’ll describe my approach below, but a thorough outline of the code I used is available at this link. In the end, I managed to come in 4th place out of about 2000 participants.

My approach

Data preparation

Because of the high amount of processing I applied to the data, I decided to treat all of it –both the training and the unlabeled cases – as a single unified data set, so that any manipulations would invariably be applied to all the data. The unlabeled cases were given a distinct temporary label for convenience. This is certainly not the approach I would typically take in a real-life situation, but it makes life a little easier here.

The columns in the provided data, both at household and individual level, had utterly cryptic names, which I assume was done for the sake of anonymity. But some columns clearly contained numerical data, while others were categorical. Leaving the household-level data essentially unchanged, I prepared the individual-level data for aggregation by converting its categorical features into indicator variables.

For each household, I aggregated the individual-level data to produce the following new household-level features:

  • the count of individuals,
  • the mean of each continuous variable from individuals, and
  • the sum of each indicator or other discrete numerical variable from individuals.

With everything aggregated to the household level, I formed some additional features by taking the product, difference, and sum of each possible pair of numerical features. The python code for this is beautifully elegant, using the itertools combinations function. For example, given a dataframe df, the following will augment it with new columns calculated from the difference of each pair of existing columns:

for c1, c2 in combinations(df.columns, 2):
    df['%s_minus_%s' % (c1, c2)] = df[c1] - df[c2]

Lastly, I dropped any feature that took on only 1 value in the data set.

Feature importances

The data preparation described above produces thousands of features, not all of which will probably have any predictive value. To help reduce the feature set, I calculated importances by permutation, in a similar way to what I described in a previous post, except now based on log loss rather than AUC. I’ve significantly improved on my previous simplistic approach to permutation importances, so that they are now calculated based on the entire available data set (i.e. all the training data) via cross-validation. The function I use to do this is as follows:

def permutation_importances(X, y, cv = 5):
    p_imp = [0] * X.shape[1]
    sk = StratifiedKFold(n_splits = cv, shuffle = True)
    for itrain, itest in sk.split(X, y):
        clf = RandomForestClassifier(
            n_jobs = -1, n_estimators = X.shape[1] * 3
        ).fit(X.iloc[itrain], y.iloc[itrain])
        ll = log_loss(y.iloc[itest], clf.predict_proba(X.iloc[itest]))
        for i, column in enumerate(X.columns):
            Xtemp = X.iloc[itest].copy()
            Xtemp[column] = Xtemp[column].sample(frac = 1.0).values
            p_imp[i] += log_loss(
                y.iloc[itest], clf.predict_proba(Xtemp)
            ) - ll
    return [i/cv for i in p_imp]

The “importance” of a feature is the amount by which the log loss of a classifier increases when that feature’s values are scrambled. Visualizing the importances of the set of features developed here demonstrates that only a fraction of those features have any predictive value.

feat_imp

Selector-estimator pipeline

To make good use of the feature importances information, the model will consist of two steps: a selector that decides how many features to use, and an estimator that makes classifications based on those features.

The selector I use is scikit-learn’s SelectKBest, with a custom score function. For this purpose, I could simply use the permutation importance function given above, but this would make the selector extremely slow; instead, I pre-compute the feature importances and use the following as the score function.

class scores():
    def __init__(self, imp):
        self.imp = np.array(imp)
    def __call__(self, X, y): 
        return self.imp

This “function” is really a callable object, which does nothing but store the list of importances already calculated, and regurgitate them when called.

The estimator I use is a gradient boosting classifier. These two steps, selector and estimator, are assembled into a “pipeline,” which allows their hyperparameters to be tuned together. The selector has only one hyperparameter: how many features to use. The estimator has about half a dozen more hyperparameters.

In a more real-life context involving complicated data transformations, it might also be wise to add the transformations as another step in the pipeline, before the selector. However, there are a few aspects of this that would need careful consideration, which we won’t get into here for the sake of brevity.

Hyperparameter tuning

To really get to the limits of performance in this contest, I put a lot of manual effort into tuning the pipeline’s hyperparameters, with the aid of scikit-learn’s GridSearchCV. After a few iterations of this process, I decided it would be useful to visualize the results of each grid search, to better inform my next round of refinements. I wrote a little tool for this purpose (see the cv_graph() function in my code); here is an example of what it produces:

param_slice2

In this example, I was experimenting with varying two of the estimator’s hyperparameters. The large dark red dot indicates that the best combination that was tried was max_depth = 3 and min_samples_split = 20. It appears that a maximum depth of 3 is a solid choice here, but I should probably try lower values of the other hyperparameter and look for further improvements.

Predictions and scores

With the pipeline tuned and trained, I applied it to the previously unlabeled part of the data set, and the resulting predictions were processed into the format the competition required, and submitted. Part of the predictions for unlabeled cases was used to determine my rank on the leaderboard while the competition was in progress (the “public” leaderboard). Once the final deadline was passed, my best set of predictions for the rest of the unlabeled cases was used to determine my true rank (on the “private” leaderboard).

Closing remarks

I reached the top 10 on the public leaderboard quite early on in the contest, and not too long afterwards reached 4th place. As I continued to refine my hyperparameters and make more submissions, I found myself slipping down until my public rank was somewhere in the teens, if I remember correctly. Much to my surprise, when the deadline arrived and the private leaderboard was released, I saw that I was in 4th place again, only narrowly missing 3rd place.

My best guess at the source of discrepancies between public and private ranks is that by the end of the contest, many contestants had made enough submissions and gathered enough leaderboard score results that they could start optimizing for their public score. So they may have been overfitting to the part of the data used for the public leaderboard, to the eventual detriment of their private score. It’s just a guess, based on the fact that I considered trying it myself.

In the end, the 1st place winning team came out well in the lead; the 2nd place contestant’s private score was almost 2% worse than theirs. The contestant in 3rd place scored about 0.4% worse than 2nd place, and my score was around 0.2% worse than 3rd place.

Coming back to my original purpose for engaging in these competitions, I am currently working on incorporating a version of the selector-estimator pipeline idea into the model I mentioned, along with some small elements of the data preparation process used here.