Creating Word Vectors with word2vec and reduce dimensionality using TSNE and Visualizing it using Bokeh

word_to_vec.jpeg

Word embeddings are a modern approach for representing text in natural language processing. Embedding algorithms like word2vec and GloVe are key to the state-of-the-art results achieved by neural network models on natural language processing.

In this blog, I will show how to train and load word embedding models for natural language processing applications in Python using Gensim and reduce dimensionality using TSNE and Visualizing it using Bokeh.

Lest code now 🙂

In this notebook, we create word vectors from a corpus of public-domain books, a selection from Project Gutenberg.

Load dependencies

In [ ]:
import nltk
from nltk import word_tokenize, sent_tokenize
import gensim
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE
import pandas as pd
from bokeh.io import output_notebook
from bokeh.plotting import show, figure
%matplotlib inline

Load data

In [ ]:
from nltk.corpus import gutenberg
In [ ]:
len(gutenberg.fileids())
In [ ]:
gutenberg.fileids()

Tokenize text

In [ ]:
# a convenient method that handles newlines, as well as tokenizing sentences and words in one shot
gberg_sents = gutenberg.sents()
In [ ]:
gberg_sents[4]

Run word2vec

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).

window: (default 5) The maximum distance between a target word and words around the target word.

min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.

workers: (default 3) The number of threads to use while training.

sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

In [ ]:
model = Word2Vec(sentences=gberg_sents, size=64, sg=1, window=10, min_count=5, seed=42)
In [ ]:
model.save('raw_gutenberg_model.w2v')

Explore model

In [ ]:
# skip re-training the model with the next line:  
model = gensim.models.Word2Vec.load('raw_gutenberg_model.w2v')
In [ ]:
model['dog']
In [ ]:
len(model['dog'])
In [ ]:
model.most_similar('dog') # distance
In [ ]:
model.similarity('father', 'dog')

Reduce word vector dimensionality with t-SNE

In [ ]:
model.wv.vocab
In [ ]:
len(model.wv.vocab)
In [ ]:
X = model[model.wv.vocab]
In [ ]:
tsne = TSNE(n_components=2, n_iter=1000) # 200 is minimum iter; default is 1000
In [ ]:
X_2d = tsne.fit_transform(X)
In [ ]:
X_2d[0:5]
In [ ]:
# create DataFrame for storing results and plotting
coords_df = pd.DataFrame(X_2d, columns=['x','y'])
coords_df['token'] = model.wv.vocab.keys()
In [ ]:
coords_df.head()
In [ ]:
coords_df.to_csv('raw_gutenberg_tsne.csv', index=False)

Visualize 2D representation of word vectors

In [ ]:
coords_df = pd.read_csv('raw_gutenberg_tsne.csv')
In [ ]:
_ = coords_df.plot.scatter('x', 'y', figsize=(12,12), marker='.', s=10, alpha=0.2)
In [ ]:
output_notebook() # output bokeh plots inline in notebook
In [ ]:
subset_df = coords_df.sample(n=5000)
In [ ]:
p = figure(plot_width=800, plot_height=800)
_ = p.text(x=subset_df.x, y=subset_df.y, text=subset_df.token)
In [ ]:
show(p)

Please comment if there is any issue with the code and if need any help to understand the flow.

The complete notebook is also available on my github repository at:

https://github.com/abhibisht89/word2vec_notebook

Happy reading :

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s