Nominally, this was for a visualization project for a data science course at WeCloudData using the Kaggle Give Me Some Credit dataset. But I figured since I was going to use that set, why not just use it for its original intent?
I perform some data cleaning and feature engineering and use gradient boosting to get a score of 0.866. It’s kind of all on the github linked above, but here’s a nice looking t-SNE graph.
t-SNE (t- distributed stochastic neighbour embedding) performs dimension reduction by maximizing the probability that two sets of features are nearby, as opposed to PCA which maximizes the distance between dissimilar features. This graph represents all the cleaned and engineered features that would be used in the final model, and with the orange “1” points representing the defaulters, the fact that so many of them are clumped together suggests that the model will end up being pretty decent. Which it did.