Text Classification using Scikit-Learn (sklearn)

Text Classification using Scikit-Learn (sklearn)

This is a classification of emails received on a mass distribution group based on subject and hand labelled categories (supervised). The solution includes preprocessing (stopwords removal, lemmatization using nltk), features using count vectorizer and tfidf transformer. The solution is a vanilla implementation that can be used to extend from here to various text classification problems.

Things that can be tweaked to improve accuracy…

  • Add more parameter configurations to GridSearchCV
  • Increase number of K Folds used with GridSearchCV, default is 3.
  • Increase the dataset (current dataset is only 500 emails)
  • The classes in the dataset are skewed with varying proportions, the dataset can either be balanced by oversampling or the weights for each class can be adjusted if the classifier allows.
  • Try different classifiers or model stacking

Quick Info…

  • Dataset: Dataset is a csv with columns ‘Subject’ and ‘Categroy’ (target variable) for about 500 emails. I’m not sharing dataset as it is from real emails taken from my inbox. Replace the dataset with your own dataset that has these two columns.
  • Pipeline and GridSearchCV: sklearn.pipeline.Pipeline and sklearn.model_selection.GridSearchCV are one of the best things in sklearn. Pipelines let you perform a series of steps on data without individually creating objects, handling parameters/return values and data hand off between steps. GridSearchCV helps with parameter tuning. It also performs cross validation with default 3 fold validation. Pipelines and GridSearchCV together reduce a lot of code complexity and improve readability of a solution.
In [1]:
import numpy as np
import pandas as pd
from pprint import pprint
from time import time

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
#Not using stemming as the performance improvement wasn't observed.
#from nltk.stem.porter import *
In [2]:
emails = pd.read_csv('emails.csv')
em = emails.dropna(axis=0)
em.sample(3)
Out[2]:
Subject Category
1103 3 BHK duplex house available for rent Real-Estate
1259 [Lease Transfer] Hyundai i10 Sportz Car with … Automobile
877 Advice about Ooty trip Travel-Fun
In [3]:
em['Category'].value_counts()
Out[3]:
Real-Estate       231
Automobile         86
Travel-Fun         45
Recommendation     43
Sale               39
Other              19
Relocation         16
Name: Category, dtype: int64
In [4]:
def pre_process_text(textArray):
    #If using stemming...
    #stemmer = PorterStemmer()
    wnl = WordNetLemmatizer()
    processed_text = []
    for text in textArray:
        words_list = (str(text).lower()).split()
        final_words = [wnl.lemmatize(word) for word in words_list if word not in stopwords.words('english')]
        #If using stemming...
        #final_words = [stemmer.stem(word) for word in words_list if word not in stopwords.words('english')]
        final_words_str = str((" ".join(final_words)))
        processed_text.append(final_words_str)
    return processed_text

em['Subject'] = pre_process_text(em['Subject'])
In [5]:
categories = [ 'Real-Estate', 'Automobile', 'Travel-Fun', 'Recommendation', 'Sale', 'Other', 'Relocation']
In [6]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
]);
In [7]:
# Every additional parameter value here will increase the training time by orders of magnitude. 
# I'm running on a relatively slow computer, hence reduced the values

parameters = {
    'vect__max_df': (0.5, 1.0),#0.6, 0.7, 0.8, 0.9, 1.0),
    'vect__max_features': (None, 1000, 5000),#2000, 3000, 4000, 5000, 6000, 10000, 20000, 30000, 40000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),#, (1, 3)),  # unigrams or bigrams or trigrams
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.1, 0.01, 0.001),#, 0.0001, 0.00001, 0.000001, 0.0000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__n_iter': (10, 50)#, 100, 200, 300, 400, 500, 100),
}
In [8]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, refit=True)

print("Grid Search started\n---------------------------------------")
print("Pipeline:", [name for name, _ in pipeline.steps])
print("Grid Search Parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(np.array(em['Subject']), np.array(em['Category']))
print("done in %0.3fs\n----------------------------------------------" % (time() - t0))

print("Best Score: %0.3f\n-------------------------------------------" % grid_search.best_score_)
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
Grid Search started
---------------------------------------
Pipeline: ['vect', 'tfidf', 'clf']
Grid Search Parameters:
{'clf__alpha': (0.1, 0.01, 0.001),
 'clf__n_iter': (10, 50),
 'clf__penalty': ('l2', 'elasticnet'),
 'tfidf__norm': ('l1', 'l2'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 1.0),
 'vect__max_features': (None, 1000, 5000),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 576 candidates, totalling 1728 fits
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 656 tasks      | elapsed:   16.2s
[Parallel(n_jobs=-1)]: Done 1656 tasks      | elapsed:   37.8s
[Parallel(n_jobs=-1)]: Done 1728 out of 1728 | elapsed:   39.5s finished
done in 40.082s
----------------------------------------------
Best Score: 0.898
-------------------------------------------
Best Parameters:
	clf__alpha: 0.001
	clf__n_iter: 50
	clf__penalty: 'l2'
	tfidf__norm: 'l2'
	tfidf__use_idf: False
	vect__max_df: 0.5
	vect__max_features: 1000
	vect__ngram_range: (1, 1)
In [9]:
test_set = [
    'RE: items for sale',
    'Coorg trip advice',
    'movie tickets for sale',
    'Advice needed for treatment of hair fall',
    'Moving out sale',
    'RE: Selling Honda City'
]
In [10]:
grid_search.best_estimator_.predict(np.array(test_set))
Out[10]:
array(['Sale', 'Travel-Fun', 'Travel-Fun', 'Recommendation', 'Relocation',
       'Automobile'], 
      dtype='<U14')
In [ ]:
 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s