Email Classifier using Mahout on Hadoop

There are three branches of Machine learning one among them is called "Classification".

What is classification?

Classification is a supervised learning technique that learns, builds experience from the existing categorised documents (i. e. training data set) and tries to predict a category to previously unseen data.

Some of the examples are predicting diseases, spam email filtering and detection of fraudulent bank transactions.

What is supervised learning?

Supervised learning is a "Machine Learning" technique wherein the training dataset is given and their appropriate results to build concepts in the system. For example: Naive Bayes Classifier.

As humans, probably we have been doing human supervised learning unknowingly. We do not open mails with subject line "YOU WON THE LOTTERY" or "CHEAP MEDICINES". With prior experience, these words in the subject line specify that this email is a SPAM. There is no compulsion that sequence of words would be in same sequence, rather it keeps changing but will have similar wordings.might have words in the same sequence, but we could have seen enough emails with similar wordings.

Supervised learning also functions in a similar manner. In case, of building a classifier say for example "email spam classifier", we train using data which has already been labelled as "Spam" or "Non-Spam", and then use that classifier to make predictions on unseen emails.

Following are the steps involved in building a classifier

1) Get/build the training set

For building a classifier, we need training data which needs to be similar with the actual data that is to be classified. Here, a point to note is that, the classifier can only be as good as training data. For example:-email spam classifier, we will require the subject lines and their label spam/non-spam.

2) Selecting the features/dimensions

Once we have the dataset, the features/dimensions need to be selected which would be used to build the classification model. For example:- a)For email spam filter, it could be words in the subject line b)For bank transactions it can be amount, account number, location of the transaction, et

3) Dimension reduction/data preparation

Once we have identified the dimension, we need to bring it to the format which can be used with algorithm or can further split the input data set into test and training dataset.

4) Build and train the classifier

Build a classifier and train it using training data set.

5) Validate

Once, we have the classifier ready, run it on the test data set and verify if it works fine. If not, we might have to change the selected model or features.

Here is an email classifier built on Mahout which uses the free email data set from http://spamassassin.apache.org/publiccorpus/ (This website provides classified data into spam and non-spam (termed as ‘ham’).)

We can use Mahout to build the Naive Baise Classifier to classify the emails.

Download the spam and ham corpus

curl -O http://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2

curl -O <a href="http://spamassassin.apache.org/publiccorpus/20030228_easy_ham.tar.bz2">http://spamassassin.apache.org/publiccorpus/20030228_easy_ham.tar.bz2</a>

Extract them; we will end up with two directories spam and easy_ham

tar xvf 20030228_spam.tar.bz2

tar xvf 20030228_easy_ham.tar.bz2

Creating a directory for dataset.

mkdir dataset

Move spam and easy_ham directories in dataset.

mv -R easy_ham/ spam/ dataset/

Copy dataset on HDFS:

hadoop fs -put dataset

Convert the dataset into SequenceFile.

mahout seqdirectory -i dataset -o dataset-seq

Convert the SequenceFile into vectors.

mahout seq2sparse -i dataset-seq -o dataset-vectors  -lnorm -nv  -wt tfidf

Split dataset into two datasets. One for testing and one for training. Randomly splitting them for training 85% of records and for training 15%

mahout split -i dataset-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 15 --overwrite --sequenceFiles -xm sequential

Training the classifier:

mahout trainnb -i train-vectors -el -o model -li labelindex -ow –c

Test against the test dataset:

mahout testnb -i test-vectors -m model -l labelindex -ow -o testing-test -c

Output Confusion Matrix

a	b		<--Classified as
382	1	383	a = easy_ham
1	69	70	b = spam

Interpretations of the Matrix

Out of 453 emails, 451 were classified correctly
382 were ham and were classified accurately as ham (True positive)
69 were spam and were classified as spam (True negative)
1 record was spam, but it has been classified as ham. (False negative)
1 record was ham, but it has been classified as spam. (False positive)

This matrix reveals that the classifier has classified the test data set with 99.5585% Accuracy.

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at Salesforce@coforge.com

About Coforge.

We are a global digital services and solutions provider, who leverage emerging technologies and deep domain expertise to deliver real-world business impact for our clients. A focus on very select industries, a detailed understanding of the underlying processes of those industries, and partnerships with leading platforms provide us with a distinct perspective. We lead with our product engineering approach and leverage Cloud, Data, Integration, and Automation technologies to transform client businesses into intelligent, high-growth enterprises. Our proprietary platforms power critical business processes across our core verticals. We are located in 23 countries with 30 delivery centers across nine countries.

Email Classifier using Mahout on Hadoop

Related reads.

Email Classifier | Hadoop | Mahout | Analytics & BI

Machines cannot teach themselves

Email Classifier | Hadoop | Mahout | Analytics & BI

The 4 Steps to GDPR Compliance Webinar

Email Classifier | Hadoop | Mahout | Analytics & BI

How to use Kafka on Kubernetes with Knative

About Coforge.

WHAT WE DO.