2017 Data Science in Review, Topic Modeling

6 September 2024

0

This blogpost is about topic modeling using data from this blog, geeksforgeeks.org. From this, combined with the most visited articles of the year, we will generate the most popular topics of 2017. Last year, we did something similar with popular articles streamed through twitter using Non-Negative Matrix Factorization to determine topics, article here, example visual below. Feature Image snapped from our introductory page tree map.

Find the code and the data to start your own project. We love feedback, so don’t forget to send us your comments and results.

0. Goal

This project is about identifying industry interest vis-a-vis the open data science blog. To do so, we collect all the articles and, well, analyze them. It both sounds fun and was. Because we presume no one person has read them all, there is exists no anecdotal distillation, but thanks to topic modeling we aimed to generate unbiased industry insight for 2017 with our blog, 467 articles, as our sample.

1. Topic Modeling is NOT Text Classification

Let us explain through examples. If you want to be aware of the topics you study you use Topic Modeling, it’s an unsupervised technique that allows you to extract topics from a corpus of documents.

On the other hand, if you have predefined tags and you want to classify new documents, you can train a model to learn about the tags and then apply it to the new documents. That is Text Classification, a supervised technique. The preprocessing of the documents is similar for both techniques.

2. Python Libraries

There are many tools and libraries choose from. We tried three Python libraries.

– gensim for topic modeling.
– nltk for Natural Language Toolkit with multiple functions and applications for NLP.
– sklearn we can work with documents too.

3. Data Collection with Selenium

Collecting the data is always an important part of the process. There are a number of different methods and tools to collect documents on the web. Sometimes the website has some javascript that won’t show you all the content at once, for those instances we can count with Selenium library.

…or… you can try using the Python 3 library Newspaper as we did in Using The Newspaper Library to Scrape News Articles.

selenium automates web browser interaction from Python. For this, download this driver for the browser you want to automate. For this case, I’m using the Google Chrome driver on a MacBook Pro computer. The driver for this is on the driver folder, and called chromedriver.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

def saveLinksToFile(links):
  # Saves the url of the articles to a file to process them later
  with open('01_data/odsc_links.csv', 'w') as f:
  f.write('linkn')
  for link in links:
    f.write('{}n'.format(link.get_attribute('href')))
  return

# Make sure you download the driver you need and it is accessible to selenium
path = 'driver/chromedriver'
driver = webdriver.Chrome(executable_path = path)
driver.get('https://geeksforgeeks.org/')
time.sleep(3)

# This for loop will hit the end key 150 times to go all down the blog
# and will wait 2 seconds to allow the content loading and will hit it again.
# it is printing the times to have 
for i in range(150):
 driver.find_element_by_css_selector('body').send_keys(Keys.END)
 time.sleep(2)
 print i,

links = driver.find_elements_by_class_name('sing_up')
saveLinksToFile(links)
driver.close()

If you run the code you should see an instance of the browser performing programmed tasks.

Next step. Download the content from those links we just collected.

file_path = '01_data/odsc_links.csv'
blogposts = pd.read_csv(file_path)
# We read the links and we create columns to store the title, the text on the article
# the tags, and date of publication
blogposts['title'] = ''
blogposts['text'] = ''
blogposts['tags'] = ''
blogposts['date'] = ''

# MAX_WAIT set how many times we will ask for a link before to jump to the following one.
MAX_WAIT = 6
INCREMENT = 1

def get_text(soup):
 ""
 Returns the text part from å beautifulsoup object
 ""
 for d in soup.find_all("div", class_='article content single-article lang-en'):
  return d.get_text()
 
def get_posting_date(soup):
 ""
 Returns the date of publication of the article.
 ""
 for d in soup.find_all("div", class_='entry-meta'):
  d1 = str(d).split('|')
  d2 = d1[1].split('/')
 return date(int(d2[2].split(' ')[0]), int(d2[0].split('>')[1]), int(d2[1]))

def get_content(link, delay = 3):
 ""
 Returns the title, the tags, the link, and the posting date of the article.
 ""
 tags = []
 title = ''
 text = "
 posting_date = ''
 
 if "?p=" not in link and delay <= MAX_WAIT:
  time.sleep(delay)
  try:
   r = urllib2.urlopen(link).read()
   soup = BeautifulSoup(r)
   text = get_text(soup)
   title = soup.title.text.split('|')[0].strip()
   tags = []

   for tag in soup.find_all("p", class_='tags_st'):
    for a in tag.find_all("a"):
     tags.append(a.get_text())
 
   posting_date = get_posting_date(soup)
 
  except:
   print "({})".format(delay + INCREMENT),
   get_content(link, (delay + INCREMENT))
 
  return tags, title, text, posting_date
 else:
  return None, None, None, None

# This for loop will get the data from the blogposts calling the function get_content() for each url
for i in range(len(blogposts)):
 print i,
 tags, title, text, posting_date = get_content(blogposts.iloc[i]['link'], 0)
 blogposts['tags'].iloc[i], blogposts['title'].iloc[i], blogposts['text'].iloc[i], blogposts['date'].iloc[i] = tags, title, text, posting_date

# We can save the the content on a json file for future use.
blogposts.to_json(path_or_buf = '01_data/data_posts_v2.json', orient='records', lines = True)

def readBlogposts(file_path):
 return pd.read_json(file_path, orient='records', lines = True)

file_path = '01_data/data_posts_v2.json'
blogposts = readBlogposts(file_path)

# we filter the articles published in 2017 only.
blogposts_2017 = blogposts[blogposts.date.dt.year == 2017]

4. Topic Modeling

Using scikit-learn and some code from here on topic modeling we can get the topics on the documents with LDA.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

def print_top_words(model, feature_names, no_top_words):
 for topic_idx, topic in enumerate(model.components_):
  print "Topic %d:" % (topic_idx)
  print " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])

documents = [doc for doc in blogposts_2017['text'] if doc is not None]

no_features = 2000

tf_vectorizer = CountVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 20

lda = LatentDirichletAllocation(n_topics = no_topics, max_iter = 5, learning_method = 'online', learning_offset = 50.,random_state = 0).fit(tf)

no_top_words = 10
print_top_words(lda, tf_feature_names, no_top_words)

Using Latent Dirichlet Allocation we will get 20 topics using 1000 features. I determined the number of topics in this case, but here is a way to determine the right number of topics for yourself.

These are the 20 topics.

Topic 0: data learning model using time different way values machine deep
Topic 1: data learning machine model use like time new ai science
Topic 2: word learning et al arxiv words language 2016 model neural
Topic 3: learning model training network image function data deep images used
Topic 4: jobs job data figure science software trends counts 2017 learning Topic 5: data use function using model dataset 10 import values let
Topic 6: amp quot pca data principal components com wp component eigenvectors
Topic 7: tree decision forest random loan data julia water charts algorithm
Topic 8: conversion neural travel rate art ai networks learning acquisition rates
Topic 9: tag countries world scraping text requests html data page self
Topic 10: data learning use models using new like used example different
Topic 11: women ai data saw work services used like numbers government
Topic 12: model data different learning new neural word example just image
Topic 13: effects learning entities nlp dask data network deep entity use
Topic 14: learning network data neural words rate model use example vectors
Topic 15: random reduce bag edge tree car collect task problem nn
Topic 16: date student year percent data title loan people countries jobs
Topic 17: docker container run image kafka root s3 command jupyter directory
Topic 18: model data set models machine test learning user engine values
Topic 19: tidy dbl data country year amp squared frame function statistic

5. Most read articles of 2017

This are the Top 10 most read articles in 2017. If we extract the topics using the code above. We get the following topics on the most read articles.

If we use the top 30 articles of 2017, and set the number of topics to 10, we get:

Topic 0: python used learning set community classification machine trained use make
Topic 1: network neural representation learning layer model size training input use
Topic 2: feature time dataset pooling network learning convolutional missing level performance
Topic 3: learning deep neural model network ai layer machine classification feature
Topic 4: learning models network systems labeled representation like generative based responses
Topic 5: learning time python like ai model machine real work used
Topic 6: probability rain ai bayesian answer interpretation algorithms problem bayes inference
Topic 7: network neural model layer input language use output training convolutional
Topic 8: learning feature network deep image layer figure neural pooling ai
Topic 9: python notebook class self code science scala object language notebooks

The next blogpost we will explore dynamic topic modeling, exciting, eh? Stay tuned.

©ODSC2018

2017 Data Science in Review, Topic Modeling

0. Goal

1. Topic Modeling is NOT Text Classification

2. Python Libraries

3. Data Collection with Selenium

4. Topic Modeling

5. Most read articles of 2017

The Moto Razr+ is a 2024 standout for the Android Police podcast

The user-facing side of Android 16 starts taking shape in this week’s news

Apple could soon be forced to make AirDrop work with Android

LEAVE A REPLY Cancel reply

Most Popular

How to Manage Saved Passwords in Chrome: 2025 Guide by Manual Thomas

What Is Zero-Knowledge Encryption? Your 2025 Guide by Tyler Cross

How Do I Know if My Email Has Been Hacked in 2025? by Manual Thomas

How to Cancel LastPass Subscription in 2025 by Tyler Cross

Recent Comments

EDITOR PICKS

How to Manage Saved Passwords in Chrome: 2025 Guide by Manual Thomas

What Is Zero-Knowledge Encryption? Your 2025 Guide by Tyler Cross

How Do I Know if My Email Has Been Hacked in 2025? by Manual Thomas

POPULAR POSTS

How to Manage Saved Passwords in Chrome: 2025 Guide by Manual Thomas

What Is Zero-Knowledge Encryption? Your 2025 Guide by Tyler Cross

How Do I Know if My Email Has Been Hacked in 2025? by Manual Thomas

POPULAR CATEGORY

ABOUT US

FOLLOW US