Email Classifier using Mahout on Hadoop

  • Written By Coforge-Salesforce BU
  • 30/03/2015

There are three branches of Machine learning one among them is called “Classification”.

What is classification?

Classification is a supervised learning technique that learns, builds experience from the existing categorised documents (i. e. training data set) and tries to predict a category to previously unseen data.

Some of the examples are predicting diseases, spam email filtering and detection of fraudulent bank transactions.

Email Classifier using Mahout on Hadoop

What is supervised learning?

Supervised learning is a “Machine Learning” technique wherein the training dataset is given and their appropriate results to build concepts in the system. For example: Naive Bayes Classifier.

As humans, probably we have been doing human supervised learning unknowingly. We do not open mails with subject line “YOU WON THE LOTTERY” or “CHEAP MEDICINES”. With prior experience, these words in the subject line specify that this email is a SPAM. There is no compulsion that sequence of words would be in same sequence, rather it keeps changing but will have similar wordings.might have words in the same sequence, but we could have seen enough emails with similar wordings.

Supervised learning also functions in a similar manner. In case, of building a classifier say for example “email spam classifier”, we train using data which has already been labelled as “Spam” or “Non-Spam”, and then use that classifier to make predictions on unseen emails.

Interested in Data & Analytics? Get in touch

Following are the steps involved in building a classifier

1) Get/build the training set

For building a classifier, we need training data which needs to be similar with the actual data that is to be classified. Here, a point to note is that, the classifier can only be as good as training data. For example:-email spam classifier, we will require the subject lines and their label spam/non-spam.

2) Selecting the features/dimensions

Once we have the dataset, the features/dimensions need to be selected which would be used to build the classification model. For example:- a)For email spam filter, it could be words in the subject line b)For bank transactions it can be amount, account number, location of the transaction, et

3) Dimension reduction/data preparation

Once we have identified the dimension, we need to bring it to the format which can be used with algorithm or can further split the input data set into test and training dataset.

4) Build and train the classifier

Build a classifier and train it using training data set.

5) Validate

Once, we have the classifier ready, run it on the test data set and verify if it works fine. If not, we might have to  change the selected model or features.

Here is an email classifier built on Mahout which uses the free email data set from (This website provides classified data into spam and non-spam (termed as ‘ham’).)

We can use Mahout to build the Naive Baise Classifier to classify the emails.

  1. Download the spam and ham corpus
curl -O
curl -O <a href=""></a>
  1. Extract them; we will end up with two directories spam and easy_ham
tar xvf 20030228_spam.tar.bz2
tar xvf 20030228_easy_ham.tar.bz2
  1. Creating a directory for dataset.
mkdir dataset
  1. Move spam and easy_ham directories in dataset.
mv -R easy_ham/ spam/ dataset/
  1. Copy dataset on HDFS:
hadoop fs -put dataset
  1. Convert the dataset into SequenceFile.
mahout seqdirectory -i dataset -o dataset-seq
  1. Convert the SequenceFile into vectors.
mahout seq2sparse -i dataset-seq -o dataset-vectors  -lnorm -nv  -wt tfidf
  1. Split dataset into two datasets. One for testing and one for training. Randomly splitting them for training 85% of records and for training 15%
mahout split -i dataset-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 15 --overwrite --sequenceFiles -xm sequential
  1. Training the classifier:
mahout trainnb -i train-vectors -el -o model -li labelindex -ow –c
  1. Test against the test dataset:
mahout testnb -i test-vectors -m model -l labelindex -ow -o testing-test -c

Output Confusion Matrix

ab <–Classified as
3821383a = easy_ham
16970b = spam

Interpretations of the Matrix

  • Out of 453 emails, 451 were classified correctly
  • 382 were ham and were classified accurately as ham (True positive)
  • 69 were spam and were classified as spam (True negative)
  • 1 record was spam, but it has been classified as ham. (False negative)
  • 1 record was ham, but it has been classified as spam. (False positive)

This matrix reveals that the classifier has classified the test data set with 99.5585% Accuracy.

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at

Other useful links:

Big Data Analytics in the Travel Industry

Your data goldmine – how to capture it, hold it, categorise it and use it

Big Data in Retail

Latest Insights

Application Integration

What is Application Integration?

Application integration helps close the gap between existing on-site systems and the ever-evolving cloud-based enterprise applications.

Coforge Salesforce High Velocity Sales

Enabling sales with Salesforce High Velocity Sales

What exactly High Velocity Sales is, how it can benefit your business and how to enable it.


Using the CRISP-DM framework for data driven projects

Learn how CRISP-DM can facilitate the planning, organising, and implementing process of data-driven projects.