If the Netflix competition was one in which we competed to see how many consecutive episodes of Netflix originals we can watch in one sitting, I'd be a touch competitor. Unfortunately, that is not the case for CS156b. The class is about machine learning and is revolved around the Netflix competition. Basically, we are given 15GBs of data of users, movies, rating times and ratings. We have to create models that train on this data to predict future ratings for the users. We make presentations every two weeks about our progress, and we compete with each other to see who can get the lowest Root Mean Squared Error (rmse) on the test set.

This class is pretty open ended in that we can try any model of our choice, but there are packages that are off limits. This meant that we have to write model classes from scratch in C++. As someone who prefers python, I was not so keen to use C++, but since the data files are so big, we have to use a language that allows for faster computations. One thing to note about the data is that all the user and movie information is encoded with IDs. As such, you cannot bin it into age groups or genres. You have to use models with latent factors that you hope can learn these groupings implicitly.

Most groups go with SVD (singular value decomposition) and its variants. This is linear algebra where you convert the data into a giant matrix and perform various factorizations. Of course, on the subject of matrix factorization, there are so many possibilities: probabilistic matrix factorization, generalized matrix factorization etc. Other models we included are clustering (K-Means, K Nearest Neighbors) and a restricted Boltzmann machine (RBM). We wanted to use neural nets and auto-encoders as well but ran out of time to debug them. Finally, with all these different models, we blend their predictions using a validation/probe set using gradient descent and boosting.

We actually started out at the very bottom of the rankings. We were the only group to not have met the first milestone. However, we met with a lot of TAs to debug code and eventually, we ended up the middle of the pack. We were 10th out of 19 groups (We were team TBD because there was a lot To Be Done). I am pretty proud of our progress and of everything I learned along the way. I think this is my favorite class this term and would highly recommend it to anyone interested in data science and/or machine learning.