Shared taste networks in the Toronto area

Posted Leave a commentPosted in General Stuff

Using the Yelp dataset, I filtered the review data for Toronto, and then only the positive reviews (3-5 stars) to see if I could find some connections between businesses by connecting satisfied customers. With Gephi, I created a 2 degree ego network centered around user HikPdCQGk1mk5JsPjur7Nw (Anna) and partitioned it with modularity to try to get networks with the most connections. If I had a computer with more than 16gb of RAM, I would be able to get even better graphs.

The whole network looks really cool, but is too broad to get any real information:

 

But if we look at the smaller modularity segments, things get more interesting.

These people like to have a good time, going to concert halls, stadiums, theatres and drinking a lot of booze. And also going to a condom store and a fetish shop. Makes sense.

This one is very interesting, and shows the value of looking at a broad range of businesses instead of filtering in only restaurants. Here we can see popular tourist destinations Casa Loma, Toronto Island, and St. Lawrence Market, which people can reach by the TTC. In the future, it might actually be valuable to keep all the reviews for the transportation instead of just positive reviews.

Credit Risk Classification

Posted Leave a commentPosted in General Stuff

On Github

Nominally, this was for a visualization project for a data science course at WeCloudData using the Kaggle Give Me Some Credit dataset. But I figured since I was going to use that set, why not just use it for its original intent?

I perform some data cleaning and feature engineering and use gradient boosting to get a score of 0.866. It’s kind of all on the github linked above, but here’s a nice looking t-SNE graph.

t-SNE (t- distributed stochastic neighbour embedding) performs dimension reduction by maximizing the probability that two sets of features are nearby, as opposed to PCA which maximizes the distance between dissimilar features. This graph represents all the cleaned and engineered features that would be used in the final model, and with the orange “1” points representing the defaulters, the fact that so many of them are clumped together suggests that the model will end up being pretty decent. Which it did.

Random data: weather and potato production in Idaho

Posted Leave a commentPosted in Random Data

The potato is one of the most important agricultural products (officially a vegetable, but sometimes hard to think of that way) in the world. With fewer growing requirements than most other staple foods, it was instrumental in fueling the European agricultural revolution and rapid population growth of the 18th and 19th centuries. It probably allowed for European nations to dominate the world for two hundred years. More importantly, potatoes – especially the Russet Burbank variety – are the key component of French fries.

(more…)

What is this? Who am I?

Posted Leave a commentPosted in General Stuff

As you may well have realized by now, this is a website of some sort, but what more can be said about it when it is still almost completely empty? The URL of the site, keithqu.com gives some clues, and it seems like it might be a name.

In fact, it is. It’s me, Keith Qu. And who might I be? I’m some guy who recently got out of grad school, with a master’s  in economics. Before that I studied business, and before that I studied chemical engineering, and before that I studied economics. It’s kind of a long story, but needless to say I’m glad to be done with school for the time being.

So what is this the purpose of this blog? Let’s see, I’m done school but I enjoy learning new things, so I figure I may as well write about it for an audience of hopefully at least a dozen people. Also, I need a job, and this seems like a good way to learn some new things while also practicing stuff like writing code, analyzing data sets, doing research, and other things that don’t require knowledge of the differences between the metric and product topologies. It’s probably also a good way of staying out of trouble, by which I mean playing Stellaris all day.

I will start with machine learning. Actually, there will probably be a lot of machine learning. In fact, for the foreseeable future this site will be largely devoted to writing about me learning about machine learning, starting with the probability and statistics review. I would like to do some old fashioned empirical stuff as well.