This article was published as a part of the Data Science Blogathon
Introduction
In this article, I will use the YouTube Trends database and Python programming language to train a language model that generates text using learning tools, which will be used for the task of making youtube video articles or for your blogs.
The topic generator is a function of Natural Language Processing and is a subject between several Machine Learning, including text compilation, text speaking, and discussion programs.
To create a title-generating work model or a text generator, the model must be trained to learn whether a word may occur, using words that already appear in sequence as context.
What is Natural Language Processing
Natural Language Processing (NLP) is often used for textual segregation activities such as spam detection and emotional analysis, text production, language translation, and text classification. Text data can be viewed in alphabetical order, word order, or sentence sequence. In general, text data is considered a sequence of words in most problems. In this article, we will enter, a process using simple sample data. However, the steps discussed here apply to any NLP activities. In particular, we will use TensorFlow2, Keras to obtain text processing which includes:
- Tokenization
- Sequence
- Padding
Building the Machine Learning Model for Title Generation
I will start this project of building a title generator with Python and machine learning by importing libraries and reading data sets. The data sets I use for this project can be downloaded from here.
Importing the necessary libraries Building the Machine Learning Model for Title Generation
Importing libraries before we start working on them. Here, I have used Keras and TensorFlow as the main libraries for our model as it is a highly productive interface for solving such problems, with a deep learning approach.
import pandas as pd import string import numpy as np import json from keras.preprocessing.sequence import pad_sequences from keras.layers import Embedding, LSTM, Dense, Dropout from keras.preprocessing.text import Tokenizer from keras.callbacks import EarlyStopping from keras.models import Sequential import keras.utils as ku import tensorflow as tf tf.random.set_seed(2) from numpy.random import seed seed(1)
Loading the dataset
#load all the datasets df1 = pd.read_csv('USvideos.csv') df2 = pd.read_csv('CAvideos.csv') df3 = pd.read_csv('GBvideos.csv') #load the datasets containing the category names data1 = json.load(open('US_category_id.json')) data2 = json.load(open('CA_category_id.json')) data3 = json.load(open('GB_category_id.json'))
Now we need to process our data so that we can use this data to train our machine learning model with the task of making a topic. Here are all the steps to clean up and process the data we need to follow:
def category_extractor(data): i_d = [data['items'][i]['id'] for i in range(len(data['items']))] title = [data['items'][i]['snippet']["title"] for i in range(len(data['items']))] i_d = list(map(int, i_d)) category = zip(i_d, title) category = dict(category) return category #create a new category column by mapping the category names to their id df1['category_title'] = df1['category_id'].map(category_extractor(data1)) df2['category_title'] = df2['category_id'].map(category_extractor(data2)) df3['category_title'] = df3['category_id'].map(category_extractor(data3)) #join the dataframes df = pd.concat([df1, df2, df3], ignore_index=True) #drop rows based on duplicate videos df = df.drop_duplicates('video_id') #collect only titles of entertainment videos #feel free to use any category of video that you want entertainment = df[df['category_title'] == 'Entertainment']['title'] entertainment = entertainment.tolist()
#remove punctuations and convert text to lowercase def clean_text(text): text = ''.join(e for e in text if e not in string.punctuation).lower() text = text.encode('utf8').decode('ascii', 'ignore') return text corpus = [clean_text(e) for e in entertainment]
Generating sequences for Building the Machine Learning Model for Title Generation
tokenizer = Tokenizer() def get_sequence_of_tokens(corpus): #get tokens tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 #convert to sequence of tokens input_sequences = [] for line in corpus: token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence) return input_sequences, total_words inp_sequences, total_words = get_sequence_of_tokens(corpus)
Padding the sequences for Building the Machine Learning Model for Title Generation
In any raw text data, there will naturally be sentences of different lengths. However, all neural networks need to be input in the same size. For this purpose, wrapping is done. The use of the ‘pre’ or ‘post’ pad depends on the analysis. In some cases, wrapping at first is appropriate while not for others. For example, if we use Recurrent Neural Network (RNN) to detect spam detection, then initial wrapping may be appropriate as RNN can read long-distance patterns. Early wrap allows us to keep track of the end which is why RNN can use these sequences to predict the next. However, any support should be made after careful consideration and business knowledge.
Since sequences can vary in length, the length of the sequence must be proportional. When using neural networks, we usually feed input to the network while waiting for the result. In practice, it is better to process data in batches than to do one at a time. The pad_sequences() is a function in the Keras deep learning library that can be used to pad variable-length sequences.
This is done using matrices [batch length x sequence length], where the length of the sequence corresponds to the longest sequence. In this case, we complete the sequence with the symbol (frequency 0) to match the size of the matrix. This process of filling the token sequence is called filling. To enter data from the training model, I need to create predictions and labels.
def generate_padded_sequences(input_sequences): max_sequence_len = max([len(x) for x in input_sequences]) input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding=’pre’)) predictors, label = input_sequences[:,:-1], input_sequences[:, -1] label = ku.to_categorical(label, num_classes = total_words) return predictors, label, max_sequence_len predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)
LSTM Model for Title Generation
In this case, while reading about a lot of layers, it becomes very difficult for the network to read and adjust the parameters of previous layers. To solve this problem, a new type of RNN has been developed; LSTM (long-term memory).
LSTM model
- Input layer: takes the word order as input
- LSTM Layout: Calculate output using LSTM units.
- Disposal layer: a regular layer to avoid overheating
- Output layer: determines whether the next word may be output
def create_model(max_sequence_len, total_words): input_len = max_sequence_len — 1 model = Sequential() # Add Input Embedding Layer model.add(Embedding(total_words, 10, input_length=input_len)) # Add Hidden Layer 1 — LSTM Layer model.add(LSTM(100)) model.add(Dropout(0.1)) # Add Output Layer model.add(Dense(total_words, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’) return model model = create_model(max_sequence_len, total_words) model.fit(predictors, label, epochs=20, verbose=5)
Now that our title generator learning model is ready and trained using data, it is time to predict the title based on the input name. The input name is completed first, the sequence is completed before being transferred to a trained model to retrieve the predicted sequence:
def generate_text(seed_text, next_words, model, max_sequence_len): for _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding=’pre’) predicted = model.predict_classes(token_list, verbose=0) output_word = “” for word,index in tokenizer.word_index.items(): if index == predicted: output_word = word break seed_text += “ “+output_word return seed_text.title()
print(generate_text(“HAPPY”, 5, model, max_sequence_len))
Output: The Secret Of HAPPY
Thanks For Reading!
About Me:
Hey, I am Sharvari Raut. I love to write!
Technical Writer 👩💻 | AI Developer😎| | Avid Reader 📖 | Data Science ❤️ | Open Source Contributor 🌍
Connect with me on:
Twitter: https://twitter.com/aree_yarr_sharu
LinkedIn: https://t.co/g0A8rcvcYo?amp=1
Github: https://github.com/sharur7
References :
Image 1: https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText
Image 2: https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText