This post is the first of a two-part series in which we apply NLP techniques to analyze articles about big data, data science, and AI.
If you are tired of the hassles of web scraping, then this post might be just for you. I occasionally web scrape news articles from the web for NLP/data science projects, such as my fake news classifier article. Even though analyzing trends in the news is one of my favorite applications of NLP, it irks me that I have to spend a considerable amount of time and effort crafting scripts to sift through piles of HTML code. So when I came across the Python 3 library Newspaper, I was overcome with joy.
In this post, I’ll demonstrate how to use Newspaper to download valuable information from multiple articles and how to put that data into a data frame.
First up, installing the library is simple. Here’s the pip command to do that.
pip3 install newspaper3k
The 3k is included so you install the Python 3 version instead of 2.
The following code demonstrates how to use the library to download the information of a single article.
As you can see from the code, this process was incredibly simple. The best part is that the information we want from the article is quite clean, we didn’t have to include any regex to extract the article title or text.
Now let’s demonstrate how to use this library on 50 links and put that the downloaded information into a pandas data frame. And this time, we’re going to profile (developer jargon measuring the time a script takes to complete) our code.
I downloaded and parsed 50 links to New York Times articles, and turned that resulting information into a pandas data frame. This whole process took me about 39 seconds, which is less than one second per link. This means that if I were to build a corpus of 1000 documents, I should expect a script to take almost 13 minutes to finish this task.
In the next article in this series, I’ll be analyzing those 50 links from above along with hundreds of other articles on data science, AI, big data, and more.
©ODSC2017