Thursday, January 9, 2025
Google search engine
HomeData Modelling & AILevel Up: spaCy NLP for the Win

Level Up: spaCy NLP for the Win

Kimberly is a speaker for ODSC East 2020! Be sure to check out her talk, “Level Up: Fancy NLP with Straightforward Tools,” at this upcoming event!


Natural language processing (NLP) is a branch of artificial intelligence in which computers extract information from written or spoken human language.  This field has experienced a massive rise in popularity over the years, not only among academic communities but also in industry settings. Because unstructured text makes up so much of the data we collect today (e.g. emails, text messages, and even this blog post), many practitioners regularly use NLP at the workplace and require straightforward tools to reliably parse through substantial amounts of documents.  The open-source library spaCy meets these exact demands by processing text quickly and accurately, all within a simplified framework.

Released in 2015, spaCy was initially created to help small businesses better leverage NLP.  Its practical design offers users a streamlined approach for accomplishing necessary NLP tasks, and it assumes a more pragmatic stance toward NLP than traditional libraries like NLTK, which were developed with a more research-focused, exploratory intention.  spaCy can be quite flexible, however, as it allows more experienced users the option of customizing just about any of its tools. spaCy is considered a Python package, but the “Cy” in spaCy indicates that Cython powers many of the underlining computations. This makes spaCy incredibly fast, even for more complicated processes.  I will illustrate a selection of spaCy’s core functionality in this post and will end by implementing these techniques on sample restaurant reviews.

spaCy Basics

 

Installation

To begin using spaCy, first download it from command line with pip:

pip install spacy

You will also need access to at least one of spaCy’s language models.  spaCy may be applied to analyze texts of various languages including English, German, Spanish, and French, each with their own model.  We’ll be working with English text for this simple analysis, so go ahead and grab spaCy’s small English language model, again through command line:

python -m spacy download en_core_web_sm

Tokenization

Now processing text boils down to loading your language model and passing strings to it directly.  Working within a Python or a Jupyter Notebook interface, let’s see what spaCy makes of one example review:

import spacy

nlp = spacy.load(‘en_core_web_sm’) 

review = “I’m so happy I went to this awesome Vegas buffet!”

doc = nlp(review)

The resulting spaCy document is a rich collection of tokens that have been annotated with many attributes including parts of speech, lemmas, dependencies, and named entities.  To see this in action, loop over each token in the document and print out the part of speech, lemma

and whether or not this token is a so-called stop word.

for token in doc:

    print(token.text, token.pos_, token.lemma_, token.is_stop)


We see in the sixth line that spaCy successfully identified “went” as a verb, that “go” is the correct lemma for this word, and that “went” is not traditionally considered a stop word.  

Before moving on to dependency parsing, note that spaCy tokenizes text in an entirely nondestructive manner; that is, each spaCy document contains a collection of token objects, which have their own attributes.  The underlying text does not change, and calling doc.text allows us to reconstruct the original content exactly. 

spaCy does not explicitly break the original text into a list, but tokens may be accessed by index span:

print(doc[:5])
print(doc[-5:-1])

spaCy also performs automatic sentence detection.  Iterating over the generator doc.sents yields each recognized sentence.

Dependencies

Natural language processing presents a host of unique challenges, with syntactic and semantic issues certainly among them.  Consider when I write the string “bear” – am I talking about a furry animal or the struggles I need to endure? Part-of-speech tagging might help in this case, but what about “bat” – time to play baseball or watch out for that nocturnal animal?  To further delineate these tricky cases, spaCy provides syntactic parsing to show word usage, thus creating a dependency tree.  spaCy identifies each token’s dependencies when text passes through the language model, so let’s check the dependencies in our restaurant review:

for token in doc:

  print(token.text, token.dep_)

This seems somewhat interesting, but visualizing these relationships reveals an even more comprehensive story.  First load a submodule called displaCy to help with the visualization:

from spacy import displacy

Then ask displaCy to render the dependency tree of our spaCy document:

displacy.render(doc)


Now, that’s impressive!  Not only has spaCy picked up on the parts of speech, but it has also determined which words modify each other and how they do so.  You can even traverse this parse tree by using properties like token.children, token.head, token.lefts, token.rights, etc.  

This navigation is particularly useful for assigning adjectives to the nouns they modify.  In our example sentence, the word “awesome” describes “buffet.” spaCy accurately labels “awesome” as an adjectival modifier (amod) and also detects its relationship to “buffet”:

for token in doc:

  if token.dep_ == 'amod':

    print(f"ADJ MODIFIER: {token.text} --> NOUN: {token.head}")


Named Entity Recognition

NLP practitioners often seek to identify key items and individuals in unstructured text.  This task, known as named entity recognition, executes automatically when text funnels through the language model.  To see which tokens spaCy identifies as named entities in our restaurant review, simply cycle through doc.ents:

for ent in doc.ents:

    print(ent.text, ent.label_)


spaCy recognizes “Vegas” as a named entity, but what does the label “GPE” mean?  If you are ever unsure what one of the abbreviations mean, just ask spaCy to explain it to you:

spacy.explain(“GPE”)


Excellent – spaCy correctly identifies “Vegas” as a city because this word is a shortened form of Las Vegas, Nevada, USA.

Furthermore, displaCy’s render method can highlight named entities if the style argument is specified:

displacy.render(doc, style=‘ent’) 


displaCy color codes the named entities by type.  Consider this more complicated example with four different kinds of entities; displaCy provides unique colors to each:

document = nlp(

“One year ago, I visited the Eiffel Tower with Jeff in Paris, France.”



displacy.render(document, style=‘ent’)


where spaCy explains “FAC” as “Buildings, airports, highways, bridges, etc.”

 

Case Study: Restaurant Reviews

Let’s now leverage the techniques discussed thus far in a small case study.  Continuing on the theme of restaurant reviews, we will examine this Kaggle dataset, consisting of 1,000 reviews labeled by sentiment.  Exactly 500 positive and 500 negative reviews make up this perfectly balanced set, and these short reviews typically consist of just one sentence each.  

Begin by loading this file into a pandas dataframe called df and rename the columns; the first five rows should appear as:

The text to be processed lives in the “text” column, so pass this entire pandas series into spaCy’s small English language model.  

 

Pipelines

Previously, we submitted a single string of text to the language model. We will now use spaCy’s pipe method in order to process multiple documents in one go.  Note that pipe returns a generator that must be converted to a list before storing in the dataframe.

df[‘spacy_doc’] = list(nlp.pipe(df.text))

spaCy’s default pipeline includes a tokenizer, a tagger to assign parts of speech and lemmas to each token, a parser to detect syntactic dependencies, and a named entity recognizer.  You may customize or remove each of these components, and you can also add extra steps to the pipeline as needed. See the spaCy documentation for more details.

 

Parts of Speech by Sentiment

A parsed version of each review now exists within the dataframe, so let’s do a quick warm-up exercise to explore the richness of these spaCy documents.  Grouping the information by sentiment label, what are the most common adjectives used in positive versus negative reviews? (A double list comprehension followed by a counter works well for this task.)

pos_adj = [token.text.lower() for doc in positive_reviews.spacy_doc 

for token in doc if token.pos_=='ADJ']

neg_adj = [token.text.lower() for doc in negative_reviews.spacy_doc 

for token in doc if token.pos_=='ADJ'] 

from collections import Counter

Counter(pos_adj).most_common(10)

Counter(neg_adj).most_common(10)


Nice!  This appears about as expected.  The word “good” tops both lists, but it seems likely that several negative reviews might mention something that was “not good.”  We have not incorporated negations yet, but spaCy certainly provides means for doing so through dependency parsing, customized tokenization, or with negspaCy.

It seems customers use adjectives to describe the service they received slightly more frequently than the restaurant’s food: “friendly” vs “delicious” in the positive reviews, “slow” vs “bland” in the negative ones.  Does this mean reviewers mostly talk about the waitstaff? Let’s check the nouns. With similar code we get:

Nope – good or bad, people overwhelming care about the food.  Amazingly, both lists match exactly on the top four nouns, but after that customers more often mention the “staff” or the “menu” when they are pleased and tend to focus on “minutes” spent and lack of “flavor” when disappointed. 

 

Dependency Parsing

Comparing the nouns and adjectives by sentiment certainly adds insight, but we probably care about the specific adjectives used to characterize each noun even more.  That is, how do people describe the food? What are they saying about the service? Navigating spaCy’s dependency tree provides these valuable details.

For a given noun of interest, extract each of the adjectival modifiers that are among its children tokens.  Consider an example pandas series of spaCy documents (ser) along with a particular noun string (noun_str).  A simple way to produce a list of adjectives modifying this noun follows:

amod_list = []

for doc in ser:

    for token in doc:

        if (token.text) == noun_str:

            for child in token.children: 

                if child.dep == amod:

                    amod_list.append(child.text.lower())

 

Since sentiment labels are supplied in this dataset, we collect these adjectives from positive and negative reviews separately.  

So, what are customers saying about the “food”?A few negation issues crop up again, but overall this provides a great perspective into what customers like and dislike about “food” in these reviews. Let’s check “service” as well:

Adjectives like “good” or “poor” may not give us much insight, but we can certainly imagine what went wrong when a customer describes their service as “rude” or “slow.”  This dataset contains a mere 1,000 reviews from various establishments. Given a larger corpus of feedback from a single place of business, spaCy could be applied to home in on competitive edge or customer pain points, all with a highly automated approach.

 

Conclusion

spaCy provides an easy-to-use framework for getting started with NLP.  Tokenization, lemmatization, dependency parsing, and named entity recognition occur by simply passing text to spaCy’s basic language model, and you can leverage the resulting token attributes to more broadly understand any set of documents.  spaCy is quite fast due to its Cython infrastructure and often proves performant in terms of accuracy.

spaCy encapsulates many more tools that were not covered in this post including techniques to: 

You can also customize just about every one of spaCy’s components, from how it performs tokenization to the steps included in its processing pipeline.

Hopefully this introduction has inspired you to give spaCy a try.  You can check out the code that powers each example in this post on Google Colab.  If you enjoyed this topic, learn more about spaCy and other exciting natural language processing tools in my upcoming talk at ODSC East, “Level Up: Fancy NLP with Straightforward Tools“!

Kimberly is a speaker for ODSC East 2020! Be sure to check out her talk, “Level Up: Fancy NLP with Straightforward Tools,” at this upcoming event!


Kimberly Fessel is a Senior Data Scientist and Instructor at Metis’s immersive data science bootcamp in New York City.  Prior to joining Metis, Kimberly worked in digital advertising where she focused on helping clients understand their customers by leveraging unstructured data with modern NLP techniques.   Her website is: http://kimberlyfessel.com/ and her LinkedIn is https://www.linkedin.com/in/kimberlyfessel/.

RELATED ARTICLES

Most Popular

Recent Comments