Generative NLP Models in Customer Service: Evaluating Them, Challenges, and Lessons Learned in Banking

16 June 2025

0

Editor’s note: The authors are speakers for ODSC Europe this June. Be sure to check out their talk, “Generative NLP models in customer service. How to evaluate them? Challenges and lessons learned in a real use case in banking,” there!

With the increasing use of digital communication, daily interactions between customer service agents and clients have shifted from traditional phone calls to chat and text messages. The banking industry is no exception, as customers reach out to agents for various reasons, such as reporting a lost card, seeking advice on investment plans, or requesting clarification on account details.

To save time for our financial advisors, our team decided to experiment with generative natural language processing (NLP) models to assist them in their daily conversations with clients. These models would allow us to automate simple and repetitive tasks, freeing up our agents to focus on more complex and valuable interactions with customers.

Although natural language generation (NLG) has gained widespread popularity in recent months due to the public release of ChatGPT, it has been used for some time now for tasks such as machine translation, question answering, and summarization. While the accuracy of NLG models has improved, there are still challenges that remain, regardless of how sophisticated the technology is.

The challenge of evaluation: the need for human criteria

When using generative methods to build systems that respond to clients’ requests, one of the biggest challenges is the evaluation phase. This stage raises many questions, such as: What does “correct” mean? Can an answer be syntactically and grammatically correct, but not meet the customer’s needs? What about a suggestion that saves time for the manager but requires editing? How can the system’s overall performance be assessed? How can we prevent the model’s output from generating hallucinations? What is the best metric to optimize the network?

Various automatic metrics, such as BLEU, ROUGE, accuracy, and precision, exist in the literature to optimize system and network performance. However, it is recommended that the chosen metric aligns with human evaluation criteria. Designing this evaluation process is not a straightforward task and needs to be adjusted to each use case.

Our use case within the banking industry

To assist financial managers in responding to customer requests, we trained a sequence-to-sequence deep learning neural network with more than one million query-answer pairs. The network’s encoder and decoder were implemented using two LSTMs.

Our results show that the network achieved automatic NLG metrics with an accuracy above 75%. However, metrics alone may not always accurately reflect the quality of the generated text output, leading us to realize that the evaluation phase needed to include human-centered annotations to align automatic metrics with human criteria.

To achieve this, we invested significant time and brains into designing a strategy for our specific use case. This strategy involved several stages, such as understanding the problem, categorizing the landscape of questions, and designing clear guidelines for annotators.

So we did! As a result, we found correlations between these metrics and human evaluation, and lots of other insights which are independent of the neural network architecture or NLG technique employed.

If you want to learn more, we will be presenting our findings at ODSC Europe 2023. We will also share other valuable lessons learned from using this type of model in customer service, based on our specific banking use case.

About the authors:

Clara Higuera Cabañes, PhD is a senior data scientist at BBVA AI Factory. She has worked in the data science field for many years applying NLP techniques to different sectors such as media or banking. At the BBC in London, she worked building recommender systems for BBC News and developed several tools to help editors understand audience feedback. In the banking sector at BBVA, she has worked on building data products to help financial advisors better manage customers’ queries. She currently leads the collections data science team at BBVA AI factory. Prior to her industry experience, she carried out her PhD in artificial intelligence and bioinformatics and holds a degree in computer science. Clara advocates for the responsible use of technology and is actively involved in activities that encourage women and girls to pursue a career in technology and science to help bridge the gender gap in these disciplines.

María Hernandez Rubio is a Senior Data Scientist and Data Product Owner at BBVA AI Factory, with ten years of experience in the Data Science field, she was one of the first Data Scientists in BBVA, taking part in the Big Data ecosystems set up in the bank. Graduated in Mathematics and Computer Engineering, she holds an MSc in Computational Intelligence from Universidad Autónoma de Madrid (UAM), specializing in Aspect-based sentiment Analysis and Item Recommendation.

She has worked in several analytical domains, ranging from Retail and Urban Analysis to Customer Intelligence. Now, she is trying to enhance the customers’ relationship with the bank through Natural Language Processing and Text Analytics. María focuses on understanding business challenges and developing the best analytical solution for each problem.

The authors would like to thank Mariana Bercowsky for her contribution to the writing of this post.

Generative NLP Models in Customer Service: Evaluating Them, Challenges, and Lessons Learned in Banking

Adding Persistent Memory to Claude Code with the Lightweight memsearch Plugin

GLM-5 vs. MiniMax M2.5 vs. Gemini 3 Deep Think: Which Model Fits Your AI Agent Stack?

We Extracted OpenClaw’s Memory System and Open-Sourced It (memsearch)

LEAVE A REPLY Cancel reply

Most Popular

I love my Pixel, but I’d trade it for this in a heartbeat

I stopped fighting Google Sheets after Gemini made formulas feel optional

We need to talk about this.

This Galaxy S26 leak highlights a trend that makes me want to skip it

EDITOR PICKS

I love my Pixel, but I’d trade it for this in a heartbeat

I stopped fighting Google Sheets after Gemini made formulas feel optional

We need to talk about this.

POPULAR POSTS

I love my Pixel, but I’d trade it for this in a heartbeat

I stopped fighting Google Sheets after Gemini made formulas feel optional

We need to talk about this.

POPULAR CATEGORY

ABOUT US

FOLLOW US