Friday, January 10, 2025
Google search engine
HomeData Modelling & AIImplementing Transformers in NLP Under 5 Lines Of Codes

Implementing Transformers in NLP Under 5 Lines Of Codes

This article was published as a part of the Data Science Blogathon

Introduction

Today, we will see a gentle introduction to the transformers library for executing state-of-the-art models for complex NLP tasks.

Applying state-of-the-art Natural Language Processing models has never been more straightforward. Hugging Face has revealed a compelling library called transformers that allow us to perform and use a broad class of state-of-the-art NLP models in a specific way.

natural language processing with transformers

Today we are operating to install and use the transformers library for diverse tasks such as:

  • Text Classification
  • Question-Answering
  • Masked Language Modeling
  • Text Generation
  • Named Entity Recognition
  • Text Summarization
  • Translation

So before we start evaluating each of the implementations for the varying tasks, let’s fix the transformers library. In my case, I am operating on macOS; when attempting to install instantly with pip, I got an error which I did by previously connecting the Rust compiler as follows:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Following that, I installed transformers shortly with pip as follows:

$ pip install transformers

Great, with the two preceding steps, your device would have installed the library accurately. So let’s start with the separate implementations; let’s work for it!

Text Classification

The text classification task consists of committing a given text to a particular class from a distributed set of classes. Sentiment analysis is the most ordinarily directed query in a text classification problem.

To practice a text classification example within the transformers library, we only require two arguments, task and model, which stipulate the nature of the query to be addressed and the model to be applied.

Given the significant heterogeneity of models entertained in the Hugging Face repository, we can begin playing with any of them. Here you can find the set of principles for text classification tasks.

We can understand the implementation of the best-base-multilingual-uncashed-sentiment model concerning sentiment analysis in the code:

from transformers import pipeline
st = f"I do not like horror movies"
seq = pipeline(task="text-classification", model='nlptown/bert-base-multilingual-uncased-sentiment')
print(f"Result: { seq(st) }")

The product is:

 

transformers pipeline output

It is necessary to examine each model’s documentation to know what datasets they were trained on and what variety of classification they function. Another great benefit of transformers is that you can use them within this library if you have your model hosted in the Hugging Face repository.

Question-Answering

Extractive Question Answering is about attempting to find an answer presented a question in an assigned context. One of the most characteristic datasets for this job is The Stanford Question Answering Dataset (SQuAD) [2].

The transformers pipeline needs context and problems. In the below code, the context is defined by a statement from the Alice in Wonderland book; the problem refers to an incident described in the section.

from transformers import pipeline
sentence = r"""
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations 
in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and 
stupid), whether the pleasure of making a daisy chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
"""
output = pipeline("question-answering", model="csarron/roberta-base-squad-v1")
question = output(question="Who was reading a book?", context=sentence)
print(f"Answer: {question['answer']}")

The product is:

 

question answering output

For this assignment, we choose the model Robert-base-squad-v1; nevertheless, we can discover diverse alternative models for this job; it would be worth exploring any of them.

Masked Language Modeling

The Masked Language Modeling task is regarding masking tokens of an addressed text sentence with a masking pass, where the model is required to satisfy each mask with a relevant permit.

For a business of this type, the transformers pipeline only needs the name of the library (in this example, it is fill-mask ), and then the text sequence wherever the token to be masked is defined;

In the latter code, we can recognize the implementation:

from transformers import pipeline
nlp = pipeline("fill-mask")
nlp(f"{nlp.tokenizer.mask_token} movies are often very scary to people")

The product is:

 

masked language modelling

The result is represented as a table of tokens and their corresponding properties. In this instance, the ticket with the most favorable record is Horror, and the lowest pass is Action.

Text Generation

Text generation leads to building a syntactically and semantically correct part of the text concerning a defined context.

The pipeline initialization wants the type of job and the model to be used, as in the earlier studies.

Lastly, the pipeline instance needs two parameters, the context (or seed) and the sequence’s length, to be produced max_length. The amount of lines to improve is an arbitrary parameter.

The resulting code shows the implementation of the GPT-2 model for the formation of 5 text arrays:

from transformers import timeline
nlp = pipeline(task='text-generation', model='gpt2')
nlp("My name is Fernando, I am from Mexico and", max_length=30, num_return_sequences=5)

The product is:

 

Text generation

Named Entity Recognition

The Named Entity Recognition job refers to the authorization of a class to each token of a presented text sequence.

It is essential to assign the task identifier to the pipeline initialization for implementation. Afterward, the object receives only one text stream.

In the following code, we can see the execution:

from transformers import pipeline
seq = r"""
I am Fernando, and I live in Mexico. I am a Machine Learning Engineer, and I work at Hitch.
"""
nlp = pipeline(task='ner')
for item in nlp(seq):
    print(f"{item['word'], item['entity']}")

The product is:

 

named entity recognition | transformers

For this case, the groups are:

  • I-MISC, Miscellaneous entity
  • I-PER, Person’s name
  • I-ORG, Organisation
  • I-LOC, Location

Text Summarization

The Text Summarization task commits to the uprooting of a summary provided a planned text. The description of the job and summarisation identifier is needed to initialize the pipeline.

To implement only the text and the maximum and minimum sequence length to be formed are required as an argument.

The following piece of text to run the code is taken from Machine Learning Wikipedia.

In the following code, we can see the execution:

from transformers import pipeline
txt = r'''
Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning is an important component of the growing field of data 
science . Machine learning, deep learning, and neural networks are all sub-fields of artificial intelligence . As big data continues to grow, the market demand for data scientists will increase, requiring them to assist in the identification of 
the most relevant business questions. Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make 
decisions with minimal human intervention.
'''
nlp = pipeline(task='summarization')
nlp(txt, max_length=130, min_length=30)

The product is:

 

Text summarization

The summary produced by the model is accurate concerning the input text. Similarly, as with the earlier tasks, we can perform several models for text summarization, such as BART, DistilBart, and Pegasus.

Translation

The translation task commits to translating a text printed in a given language to a different language. The transformers library acknowledges the application of state-of-the-art models for translation such as T5 in an offhand style.

The pipeline is initialized with the identifier of the task to be answered, referring to the original language and translated language; for example, to translate from French to English, the identifier is translation_en_to_fr.

Finally, the returned object holds the text to be rendered as an argument.

In the following code, we can discuss the implementation of the translator of writing from English to French:

from transformers importpipeline
txt = r'''
Machine learning is a branch of artificial intelligence (AI) and computer sciencewhich focuses on the use of data and algorithms to imitate the way that humans learn,gradually improving its accuracy
'''
nlp = pipeline(task='translation_en_to_fr')
print(f"{nlp(txt)[0]['translation_text']}")

The product is:

Translation | transformers

Inference

We discussed how to practice the transformers library to execute state-of-the-art NLP models straightforwardly during this tutorial blog.

Also, we discussed implementing some of the most popular tasks. However, it is essential to mention that the samples shown in this blog are solely for the conclusion. However, one of the excellent attributes of the transformers library is that it implements the methods to fine-tune our models based on these previously trained, which would be a suitable subject for the next blog.

If you enjoy reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via LinkedIn and Github. Please do not hesitate to send a contact request!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

RELATED ARTICLES

Most Popular

Recent Comments