Everyone would've experienced Recommendation Systems on the web. When we login to YouTube, we are automatically presented with a list of recommended videos. When we login to Amazon or Flipkart, we are presented with a list of items recommended for us. Facebook provides us with a list of Friend recommendations. When we search on Google, it throws up a suggested search text. These are all different types of recommendations. Success of these types of recommendation sites depends on the volume and quality of data available. These days, availability of data is not at all a problem, thanks to the Big Data and Hadoop. Recommendation Engines are one of the easiest areas to start with when dealing with Machine Learning.
How does a Recommendation System work?
Taking Amazon.com as an example, whenever a user visits the site and clicks on a book, an Ajax event is fired. This event will make an entry in the database (generally a NoSQL database) – “User X visited Product Y once”. Where do we get the User information? If he is a logged in user, we get it from HttpSession or we extract it from the system Cookie. If it is Cookie based, then the recommendation only works as long as the user visits the site from the same terminal. Whenever a user adds a product to their shopping cart, purchases a product, likes a product on Facebook, or writes a review on a product, similar events are fired. Now we(the concerned website) have data captured for the user and we know what he viewed, how many times he viewed, what product he might be interested in, what product he actually likes and so on.
How do we make sense of the data collected?
Recommendation systems work on 3 fields – User, Item and Rating.
We have User and Item but what about Rating? Generally most systems work on a scale of 1-5. Each site has its own interpretation of rating, but the general rule is outlined below. If a User viewed a Product, we can give an implicit rating of 2.5, if he added a product to his shopping cart then we can give it 3.5, if he purchased a product then it is 4.5. Apart from this, the user can explicitly rate the products on the product. If the user writes a review for a product, then we can perform sentiment analysis on the review. This would tell us what the customer feels about the product – very negative, negative, neutral, positive, or very positive. This we can map to a scale of 1-5.
What’s Collaborative Filtering?
Using ratings we collect, we provide recommendations. Since the rating from one user determines the recommendation for another user, we call this Collaborative Filtering. There are two types of recommendations – User Based and Item Based Collaborative Filtering.
In User Based Collaborative Filtering, to recommend products for a given user, we compute similarity between the user and every other user in the site. Similarity is computed using distance algorithms or correlation algorithms. For instance if User-X rated a Product P1 as 3 and Product P2 as 1, and User-Y rated P1 as 1 and P2 as 5, then the distance between them is high. Euclidean distance is a popular method for computing distance, which gives a value between 0 and 1 (0 meaning they both have exactly same taste and 1 means they are exactly opposite). Pearson Correlation is another alternative which is better than Euclidean. It is very much applicable in scenarios like this – User-X is not very generous and always scores 3.5 for excellent and User-Y is generous and always scores 5 for excellent. In this case if we use correlation, we can find the similarity better.
In Item Based Collaborative Filtering, we compute similarity for a given item to every other item. This is what we see when we visit Amazon – “People who bought this also bought this”.
What does one choose?
Usually the number of products sold on a site is far more than the number of users. So if we need to compute the similarity for a product with every other product, then it is very time-consuming operation. However the advantage of this is, we only need to compute product similarity when a new product is added and this can be run as an hourly or daily batch.
In contrast to this, if we need to compute the User similarity, then first we need to compute the distance between one user and every other user. Then sort the distance in ascending order and take the top n-users (this is called K-Nearest Neighbors). For each of those n-users, which products have they rated and what rating have they given. From them choose the highest rated m-products and add it into a list. So by the end of this exercise, we have around n*m products.
What technology do we use?
As you can see this operation is a very high CPU consuming task and it runs for hours. To make it parallel and fast, we can use solutions like Hadoop.
Mahout is a library of machine learning algorithms and that runs on Hadoop. It has a configurable set of options to choose the recommendation algorithm, similarity methods, choosing n-nearest neighbors, etc. It is a standard Java class but runs on Hadoop. There is another popular product available called PredictionIO, which makes our lives easier. It bundles Hadoop and Mahout inside it and provides a nice User-Interface to manage the recommendation.
Cold Start – The success of the algorithm as stated earlier, depends on the data. So if we don’t have any users available in our system, then we can’t get started. This is what is termed the cold-start problem. There are solutions available for this.
Noisy Data – The other aspect which influences the success of this algorithm is the quality of data. If the ratings are coming from people, how genuine is it or they are doing it for an incentive.
Privacy– Privacy is being compromised in these applications. So some people would not like to share their private information and generally clear the browser cache after the session. Some of the Governments are taking up Privacy as a bigger issue and enforcing rules around it.
How do we build a sample application and try it? Install Hadoop and Mahout or just PredictionIO on Linux or VM. There are a lot of datasets available for practice. The most popular one is MovieLens dataset which contains Movies and Ratings. There are 3 different datasets – one with 100K ratings, another with 1Million and the last one with 10Million ratings. You can download it here.
If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at Salesforce@coforge.com
Some useful links:
Bright lights, smart city, Big Data
What can companies do to make a big leap in big data
Email Classifier using Mahout on Hadoop