Sunday, November 17, 2024
Google search engine
HomeLanguagesNavigation with BeautifulSoup

Navigation with BeautifulSoup

BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the idiomatic way of navigating, searching, and modifying the parse tree.

Installation

This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

Navigation With BeautifulSoup

Below code snippet is the HTML document which we shall use, to navigate using BeautifulSoup tags with this code snippet as reference.

Python3




ht_doc = """
  
<html><head><title>Geeks For Geeks</title></head>
  
<body>
  
<p class="title"><b>most viewed courses in GFG,its all free</b></p>
 
 
  
  
  
  
<p class ="prog">Top 5 Popular Programming Languages</p>
 
 
  
  
  
  
class="prog" id="link1">Java</a>
id="link2">c/c++</a>
class="prog" id="link3">Python</a>
class="prog" id="link4">Javascript</a>
class="prog" id="link5">Ruby</a>
  
  
  
  
 
 
<p>according to an online survey. </p>
 
 
  
  
  
  
<p class="prog"> Programming Languages</p>
 
 
  
  
  
  
</body></html>
  
"""


Now let us navigate in all possible ways by applying BeautifulSoup in Python on the above code snippet, the most important component in Html documents are tags which may also contain other tags/strings(tag’s children). BeautifulSoup provides different ways to iterate over these children, let us see all possible cases 

Navigating Downwards

Navigating Using Tag Names :

Example 1: To get Head Tag.

Use .head to BeautifulSoup object to get the head tag in HTML document.

Syntax : (BeautifulSoup Variable).head

Example 2: To get Title Tag 

Use .title tag to retrieve the title of the HTML document embedded in BeautifulSoup variable 

Syntax : (BeautifulSoup Variable).title

Code: 

Python3




soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)


 
Output: 

<head><title>Geeks For Geeks</title></head>
<title>Geeks For Geeks</title>

Example 3: To get a specific tag.

We can retrieve some specific tags like the first <b> tag in the body tag  

Syntax : (BeautifulSoup Variable).body.b

Using tag name as an attribute will get you the first name of that name 

Syntax: (BeautifulSoup Variable).(tag attribute)

By using find_all, we can get all contents associated with the attribute 

Syntax: (BeautifulSoup Variable).find_all(tag value)

Code:

Python3




soup = BeautifulSoup(ht_doc, 'html.parser')
 
# retrieving b tag element
print(soup.body.b)
 
# retrieving a tag element from BeautifulSoup assigned variable
print(soup.a)
 
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))


Output:

<b>most viewed courses in GFG,its all free</b>

<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>

[<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>,

<a class=”prog” href=”https://www.geeksforgeeks.org/cc-programs/” id=”link2″>c/c++</a>, 

<a class=”prog” href=”https://www.geeksforgeeks.org/python-programming-examples/” id=”link3″>Python</a>,

<a class=”prog” href=”https://https://www.geeksforgeeks.org/introduction-to-javascript/” id=”link4″>Javascript</a>,

<a class=”prog” href=”https://www.geeksforgeeks.org/ruby-programming-language/” id=”link5″>Ruby</a>]

Example 4: Contents and .children

We can get tags children in a list by using .contents.

Syntax: (BeautifulSoup Variable).contents

Code:

Python3




soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head tag of BeautifulSoup variable
hTag = soup.head
print(hTag)
 
# retrieving contents of BeautifulSoup variable
print(hTag.contents)


Output:

<head><title>Geeks For Geeks</title></head>
[<title>Geeks For Geeks</title>]

Example 5: .descendants

The .descendants attribute allows you to iterate over all of a tag’s children, recursively −its direct children and the children of its direct children and so on…

Syntax: (Variable assigned from BeautifulSoup Variable).descendants

Code:

Python3




# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head
 
# iterating through child in descendants of htag variable
for child in htag.descendants:
    print(child)


Output :

<title>Geeks For Geeks</title>
Geeks For Geeks

Example 6: .string

If the tag has only one child, and that child is a NavigableString, the child is made available as .string

However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None, we can see this practical working in below code.

Python3




soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)


Output:

Geeks For Geeks

Example 7: .strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator.

Python3




soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
    print(repr(string))


Output :

'\n'
'Geeks For Geeks'
'\n'
'\n'
'most viewed courses in GFG,its all free'
'\n'
'Top 5 Popular Programming Languages'
'\n'
'Java'
'\n'
'c/c++'
'\n'
'Python'
'\n'
'Javascript'
'\n'
'Ruby'
'\naccording to an online survey. '
'\n'
' Programming Languages'
'\n'

For removal of extra whitespaces, we use .stripped_strings generator :

Python3




# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
    print(repr(string))


Output:

'Geeks For Geeks'
'most viewed courses in GFG,its all free'
'Top 5 Popular Programming Languages'
'Java'
'c/c++'
'Python'
'Javascript'
'Ruby'
'according to an online survey.'
'Programming Languages'

Navigating Upwards Through BeautifulSoup :

If we consider a “family tree” analogy, every tag and every string has a parent: the tag that contains it:

Example 1: .parent.

.parent tag is used for retrieving the element’s parent element

Syntax : (BeautifulSoup Variable).parent

Code:

Python3




ht_doc = """
<html><head><title>Geeks For Geeks</title></head>
<body>
<p class="title"><b>most viewed courses in GFG,its all free</b></p>
 
 
 
 
 
<p class ="prog">Top 5 Popular Programming Languages</p>
 
 
 
 
 
class="prog" id="link1">Java</a>
id="link2">c/c++</a>
class="prog" id="link3">Python</a>
class="prog" id="link4">Javascript</a>
class="prog" id="link5">Ruby</a>
according to an online survey. </a>
<p class="prog"> Programming Languages</p>
 
 
 
 
 
</body></html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# embedding html document
Itag = soup.title
 
# assigning title tag of BeautifulSoup-assigned variable
# to print parent element in Itag variable
print(Itag.parent)
htmlTag = soup.html
print(type(htmlTag.parent))
print(soup.parent)


Output:

<head><title>Geeks For Geeks</title></head>
<class 'bs4.BeautifulSoup'>
None

Example 2: .parents

For iterating all over the parent elements, .parents tag can be used :

Syntax :(BeautifulSoup Variable).parents

Python3




# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# embedding a tag into link variable
link = soup.a
print(link)
 
# iterating through parent in link variable
for parent in link.parents :
     
    # printing statement for Parent is empty case
    if parent is None :
        print(parent)
    else :
        print(parent.name)


Output: 

<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>

body

html

[document]

Navigating Sideways With BeautifulSoup

.next_sibling and .previous_sibling are the tags that are used for navigating between page elements that are on same level of the parse tree. 

Syntax:

(BeautifulSoup Variable).(tag attribute).next_sibling

(BeautifulSoup Variable).(tag attribute).previous_sibling

Code: 

Python3




from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("<a><b>Geeks For Geeks</b><c><strong>The \
Biggest Online Tutorials Library, It's all Free</strong></b></a>")
 
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling)
 
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling)


Output:

<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>
<b>Geeks For Geeks</b>

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments