Navigation with BeautifulSoup

26 July 2024

1

BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the idiomatic way of navigating, searching, and modifying the parse tree.

Installation

This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

Navigation With BeautifulSoup

Below code snippet is the HTML document which we shall use, to navigate using BeautifulSoup tags with this code snippet as reference.

Python3

ht_doc = """
  
<html><head><title>Geeks For Geeks</title></head>
  
<body>
  
<p class="title"><b>most viewed courses in GFG,its all free</b></p>
 
<p class ="prog">Top 5 Popular Programming Languages</p>
 
<a href="https://www.geeksforgeeks.org/java-programming-examples/" \
class="prog" id="link1">Java</a>
<a href="https://www.geeksforgeeks.org/cc-programs/" class="prog" \
id="link2">c/c++</a>
<a href="https://www.geeksforgeeks.org/python-programming-examples/"\
class="prog" id="link3">Python</a>
<a href="https://https://www.geeksforgeeks.org/introduction-to-javascript/"\
class="prog" id="link4">Javascript</a>
<a href="https://www.geeksforgeeks.org/ruby-programming-language/" \
class="prog" id="link5">Ruby</a>
  
<p>according to an online survey. </p>
 
<p class="prog"> Programming Languages</p>
 
</body></html>
  
"""

Now let us navigate in all possible ways by applying BeautifulSoup in Python on the above code snippet, the most important component in Html documents are tags which may also contain other tags/strings(tag’s children). BeautifulSoup provides different ways to iterate over these children, let us see all possible cases

Navigating Downwards

Navigating Using Tag Names :

Example 1: To get Head Tag.

Use .head to BeautifulSoup object to get the head tag in HTML document.

Syntax : (BeautifulSoup Variable).head

Example 2: To get Title Tag

Use .title tag to retrieve the title of the HTML document embedded in BeautifulSoup variable

Syntax : (BeautifulSoup Variable).title

Code:

Python3

soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)

Output:

<head><title>Geeks For Geeks</title></head>
<title>Geeks For Geeks</title>

Example 3: To get a specific tag.

We can retrieve some specific tags like the first <b> tag in the body tag

Syntax : (BeautifulSoup Variable).body.b

Using tag name as an attribute will get you the first name of that name

Syntax: (BeautifulSoup Variable).(tag attribute)

By using find_all, we can get all contents associated with the attribute

Syntax: (BeautifulSoup Variable).find_all(tag value)

Code:

Python3

soup = BeautifulSoup(ht_doc, 'html.parser')
 
# retrieving b tag element
print(soup.body.b)
 
# retrieving a tag element from BeautifulSoup assigned variable
print(soup.a)
 
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))

Output:

<b>most viewed courses in GFG,its all free</b>

<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>

[<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>,

<a class=”prog” href=”https://www.geeksforgeeks.org/cc-programs/” id=”link2″>c/c++</a>,

<a class=”prog” href=”https://www.geeksforgeeks.org/python-programming-examples/” id=”link3″>Python</a>,

<a class=”prog” href=”https://https://www.geeksforgeeks.org/introduction-to-javascript/” id=”link4″>Javascript</a>,

<a class=”prog” href=”https://www.geeksforgeeks.org/ruby-programming-language/” id=”link5″>Ruby</a>]

Example 4: Contents and .children

We can get tags children in a list by using .contents.

Syntax: (BeautifulSoup Variable).contents

Code:

Python3

soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head tag of BeautifulSoup variable
hTag = soup.head 
print(hTag)
 
# retrieving contents of BeautifulSoup variable
print(hTag.contents) 

Output:

<head><title>Geeks For Geeks</title></head>
[<title>Geeks For Geeks</title>]

Example 5: .descendants

The .descendants attribute allows you to iterate over all of a tag’s children, recursively −its direct children and the children of its direct children and so on…

Syntax: (Variable assigned from BeautifulSoup Variable).descendants

Code:

Python3

# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser') 
 
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head 
 
# iterating through child in descendants of htag variable
for child in htag.descendants: 
    print(child)

Output :

<title>Geeks For Geeks</title>
Geeks For Geeks

Example 6: .string

If the tag has only one child, and that child is a NavigableString, the child is made available as .string

However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None, we can see this practical working in below code.

Python3

soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)

Output:

Geeks For Geeks

Example 7: .strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator.

Python3

soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
    print(repr(string))

Output :

'\n'
'Geeks For Geeks'
'\n'
'\n'
'most viewed courses in GFG,its all free'
'\n'
'Top 5 Popular Programming Languages'
'\n'
'Java'
'\n'
'c/c++'
'\n'
'Python'
'\n'
'Javascript'
'\n'
'Ruby'
'\naccording to an online survey. '
'\n'
' Programming Languages'
'\n'

For removal of extra whitespaces, we use .stripped_strings generator :

Python3

# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser') 
 
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
    print(repr(string))

Output:

'Geeks For Geeks'
'most viewed courses in GFG,its all free'
'Top 5 Popular Programming Languages'
'Java'
'c/c++'
'Python'
'Javascript'
'Ruby'
'according to an online survey.'
'Programming Languages'

Navigating Upwards Through BeautifulSoup :

If we consider a “family tree” analogy, every tag and every string has a parent: the tag that contains it:

Example 1: .parent.

.parent tag is used for retrieving the element’s parent element

Syntax : (BeautifulSoup Variable).parent

Code:

Python3

ht_doc = """
<html><head><title>Geeks For Geeks</title></head>
<body>
<p class="title"><b>most viewed courses in GFG,its all free</b></p>
 
 
 
 
 
<p class ="prog">Top 5 Popular Programming Languages</p>
 
 
 
 
 
<a href="https://www.geeksforgeeks.org/java-programming-examples/"\
class="prog" id="link1">Java</a>
<a href="https://www.geeksforgeeks.org/cc-programs/" class="prog" \
id="link2">c/c++</a>
<a href="https://www.geeksforgeeks.org/python-programming-examples/"\
class="prog" id="link3">Python</a>
<a href="https://https://www.geeksforgeeks.org/introduction-to-javascript/"\
class="prog" id="link4">Javascript</a>
<a href="https://www.geeksforgeeks.org/ruby-programming-language/"\
class="prog" id="link5">Ruby</a>
according to an online survey. </a>
<p class="prog"> Programming Languages</p>
 
 
 
 
 
</body></html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser') 
 
# embedding html document
Itag = soup.title 
 
# assigning title tag of BeautifulSoup-assigned variable
# to print parent element in Itag variable
print(Itag.parent) 
htmlTag = soup.html
print(type(htmlTag.parent))
print(soup.parent)

Output:

<head><title>Geeks For Geeks</title></head>
<class 'bs4.BeautifulSoup'>
None

Example 2: .parents

For iterating all over the parent elements, .parents tag can be used :

Syntax :(BeautifulSoup Variable).parents

Python3

# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser') 
 
# embedding a tag into link variable
link = soup.a 
print(link)
 
# iterating through parent in link variable
for parent in link.parents :
     
    # printing statement for Parent is empty case
    if parent is None : 
        print(parent)
    else :
        print(parent.name)

Output:

<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>

body

html

[document]

Navigating Sideways With BeautifulSoup

.next_sibling and .previous_sibling are the tags that are used for navigating between page elements that are on same level of the parse tree.

Syntax:

(BeautifulSoup Variable).(tag attribute).next_sibling

(BeautifulSoup Variable).(tag attribute).previous_sibling

Code:

Python3

from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("<a><b>Geeks For Geeks</b><c><strong>The \
Biggest Online Tutorials Library, It's all Free</strong></b></a>")
 
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling) 
 
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling) 

Output:

<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>
<b>Geeks For Geeks</b>

Navigation with BeautifulSoup

Installation

Navigation With BeautifulSoup

Python3

Navigating Downwards

Navigating Using Tag Names :

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Navigating Upwards Through BeautifulSoup :

Python3

Python3

Navigating Sideways With BeautifulSoup

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US