How to Extract Wikipedia Data in Python?

27 July 2024

0

In this article we will learn how to extract Wikipedia Data Using Python, Here we use two methods for extracting Data.

Method 1: Using Wikipedia module

In this method, we will use the Wikipedia Module for Extracting Data. Wikipedia is a multilingual online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors using a wiki-based editing system.

For installation run this command into your terminal.

pip install wikipedia

Wikipedia Data, we will be extracted here:-

Summary, title
Page Content
Get the list of Image Source and Page URL
Different categories

Extract Data one by one:

1. Extracting summary and page

Syntax: wikipedia.summary(“Enter Query”)

wikipedia.page(“Enter Query”).title

Python3

import wikipedia
 
wikipedia.summary("Python (programming language)")

Output:

2. Page Content:

For extracting the content of an article, we will use page() method and content property to get the actual data.

Syntax: wikipedia.page(“Enter Query”).content

Python3

wikipedia.page("Python (programming language)").content

Output:

3. Extract images from Wikipedia.

Syntax: wikipedia.page(“Enter Query”).images

Python3

wikipedia.page("Python (programming language)").images

Output:

4. extract current Page URL:

Use page() method and url property.

Syntax: wikipedia.page(“Enter Query”).url

Python3

wikipedia.page('"Hello, World!" program').url

Output:

'https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'

5. Get the list of categories of articles.

Use page() method and categories property.

Syntax: wikipedia.page(“Enter Query”).categories

Python3

wikipedia.page('"Hello, World!" program').categories

Output:

['Articles with example code',
 'Articles with short description',
 'Commons category link is on Wikidata',
 'Computer programming folklore',
 'Short description is different from Wikidata',
 'Test items in computer languages',
 'Webarchive template wayback links']

6. Get the list of all links to an article

Syntax: wikipedia.page(“Enter Query”).links

Python3

wikipedia.page('"Hello, World!" program').links

Output:

7. Get data in different languages.

Now we will see language conversion, for converting into another language we will use set_lang() method.

Syntax: wikipedia.set_lang(“Enter Language Type”)

Python3

wikipedia.set_lang("hi")
wikipedia.summary('"Hello, World!" program')

Output:

Method 2: Using Requests, BeautifulSoup

In this method, we will use Web Scraping.

For scraping in Python we will use two modules:

bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

requests: Requests allow you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.

pip install requests

Data will be extracted:-

Paragraphs
Images
List of Images
Headings
Unwanted Content (Remaining Content)

Approach:

Get HTML Code
From HTML Code, get the content of inside body tag
Iterate

Python3

# Import Module
from bs4 import *
import requests
 
# Given URL
url = "https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)"
 
# Fetch URL Content
r = requests.get(url)
 
# Get body content
soup = BeautifulSoup(r.text,'html.parser').select('body')[0]
 
# Initialize variable
paragraphs = []
images = []
link = []
heading = []
remaining_content = []
 
# Iterate through all tags
for tag in soup.find_all():
     
    # Check each tag name
    # For Paragraph use p tag
    if tag.name=="p":
       
        # use text for fetch the content inside p tag
        paragraphs.append(tag.text)
         
    # For Image use img tag
    elif tag.name=="img":
       
        # Add url and Image source URL
        images.append(url+tag['src'])
         
    # For Anchor use a tag
    elif tag.name=="a":
       
        # convert into string and then check href 
        # available in tag or not
        if "href" in str(tag):
           
          # In href, there might be possible url is not there
          # if url is not there
            if "https://en.wikipedia.org/w/" not in str(tag['href']):
                link.append(url+tag['href'])
            else:
                link.append(tag['href'])
                 
    # Similarly check for heading 
    # Six types of heading are there (H1, H2, H3, H4, H5, H6)
    # check each tag and fetch text
    elif "h" in tag.name:
        if "h1"==tag.name:
            heading.append(tag.text)
        elif "h2"==tag.name:
            heading.append(tag.text)
        elif "h3"==tag.name:
            heading.append(tag.text)
        elif "h4"==tag.name:
            heading.append(tag.text)
        elif "h5"==tag.name:
            heading.append(tag.text)
        else:
            heading.append(tag.text)
             
    # Remain content will store here
    else:
        remaining_content.append(tag.text)
         
print(paragraphs, images, link, heading, remaining_content)

body content and fetch the above data

Below is the full implementation:

Python3

# Import Module
from bs4 import *
import requests
 
# Given URL
url = "https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)"
 
# Fetch URL Content
r = requests.get(url)
 
# Get body content
soup = BeautifulSoup(r.text,'html.parser').select('body')[0]
 
# Initialize variable
paragraphs = []
images = []
link = []
heading = []
remaining_content = []
 
# Iterate through all tags
for tag in soup.find_all():
     
    # Check each tag name
    # For Paragraph use p tag
    if tag.name=="p":
       
        # use text for fetch the content inside p tag
        paragraphs.append(tag.text)
         
    # For Image use img tag
    elif tag.name=="img":
       
        # Add url and Image source URL
        images.append(url+tag['src'])
         
    # For Anchor use a tag
    elif tag.name=="a":
       
        # convert into string and then check href 
        # available in tag or not
        if "href" in str(tag):
           
          # In href, there might be possible url is not there
          # if url is not there
            if "https://en.wikipedia.org/w/" not in str(tag['href']):
                link.append(url+tag['href'])
            else:
                link.append(tag['href'])
                 
    # Similarly check for heading 
    # Six types of heading are there (H1, H2, H3, H4, H5, H6)
    # check each tag and fetch text
    elif "h" in tag.name:
        if "h1"==tag.name:
            heading.append(tag.text)
        elif "h2"==tag.name:
            heading.append(tag.text)
        elif "h3"==tag.name:
            heading.append(tag.text)
        elif "h4"==tag.name:
            heading.append(tag.text)
        elif "h5"==tag.name:
            heading.append(tag.text)
        else:
            heading.append(tag.text)
             
    # Remain content will store here
    else:
        remaining_content.append(tag.text)
         
print(paragraphs, images, link, heading, remaining_content)

Example:

How to Extract Wikipedia Data in Python?

Method 1: Using Wikipedia module

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Method 2: Using Requests, BeautifulSoup

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Recent Comments

EDITOR PICKS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR POSTS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US