A INTRODUCTION TO NLP IN PYTHON WITH SPACY, “Feel the power of C”

Natural Language Processing (NLP) is one of the most interesting sub-fields of data science and data scientists are increasingly expected to be able to whip up solutions.

spaCy, you say?

spaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at Explosion AI.it’s implemented in Cython.If you are familiar with the Python data science stack, spaCy is your numpy for NLP – it’s reasonably low-level, but very intuitive and performant.

So, what can it do?

spacy provides a one-stop-shop for tasks commonly used in any NLP project, including:

Tokenisation
Lemmatisation
Part-of-speech tagging
Entity recognition
Dependency parsing
Sentence recognition
Word-to-vector transformations
Many convenience methods for cleaning and normalising text

Let’s get started!

Installation:

pip install spacy

To download all the data and models, run the following command, after the installation:

python -m spacy.en.download all

You are now all set to explore and use spacy.

Lets see some code in action:

First, we load spaCy’s pipeline, which by convention is stored in a variable named nlp. declaring this variable will take a couple of seconds as spaCy loads its models and data to it up-front to save time later.Note that here I am using the English language model.

We will load the default model which is english-core-web.

import spacy nlp = spacy.load(“en”)
doc = nlp("The big grey dog ate all of the chocolate,but fortunately he wasn't sick!")

The document is now part of spacy.english model’s class and is associated with a number of properties.The Doc object is now a vessel for NLP tasks on the text itself.

Tokenization:
Tokenisation is a foundational step in many NLP tasks. Tokenising text is the process of splitting a piece of text into words, symbols, punctuation, spaces and other elements,thereby creating “tokens”

[token.orth_ for token in doc]

Here we access the each token’s .orth_ method, which returns a string representation of the token rather than a SpaCy token object

Lemmatization:

A related task to tokenisation is lemmatisation. Lemmatisation is the process of reducing a word to its base form, its mother word if you like.

practice = "practice practiced practicing"
nlp_practice = nlp(practice)[word.lemma_ for word in nlp_practice]
out:
 ['practice', 'practice', 'practice']

POS Tagging:

Part-of-speech tagging is the process of assigning grammatical properties (e.g. noun, verb, adverb, adjective etc.) to words.Words that share the same POS tag tend to follow a similar syntactic structure and are useful in rule-based processes.

doc2 = nlp("Conor's dog's toy was hidden under the man's sofa in the woman's house")
pos_tags = [(i, i.tag_) for i in doc2]
out:
  [(Conor, 'NNP'),('s, 'POS'),(dog, 'NN'),('s, 'POS'),(toy, 'NN'),(was, 'VBD'),(hidden, 'VBN'),(under, 'IN'),(the, 'DT'),(man, 'NN'),('s, 'POS'),(sofa, 'NN'),(in, 'IN'),(the, 'DT'),(woman, 'NN'),('s, 'POS'),(house, 'NN')]

Entity recognition:

Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc.

wiki_obama = """Barack Obama is an American politician who served asthe 44th President of the United States from 2009 to 2017. He is the firstAfrican American to have served as president,as well as the first born outside the contiguous United States."""<br>nlp_obama = nlp(wiki_obama)[(i, i.label_, i.label) for i in nlp_obama.ents]<br>
out:
    [(Barack Obama, 'PERSON', 346),(American, 'NORP', 347),(the United States, 'GPE', 350),(2009 to 2017, 'DATE', 356),(first, 'ORDINAL', 361),(African, 'NORP', 347),(American, 'NORP', 347),(first, 'ORDINAL', 361),(United States, 'GPE', 350)]

Sentence identifier:

We can also detect the sentence boundary using spacy that is really helpful in our NLP task

for ix, sent in enumerate(nlp_obama.sents, 1): 
         print("Sentence number {}: {}".format(ix, sent))<br>
out:
    Sentence number 1: Barack Obama is an American politician who served asthe 44th President of the United States from 2009 to 2017.Sentence number 2: He is the firstAfrican American to have served as president,as well as the first born outside the contiguous United States.

There are other stuffs that we can do using spacy, I will post about those as well in upcoming blogs. Hope this will help you to start in NLP using spacy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s