BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the idiomatic way of navigating, searching, and modifying the parse tree.
Installation
This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
Navigation With BeautifulSoup
Below code snippet is the HTML document which we shall use, to navigate using BeautifulSoup tags with this code snippet as reference.
Python3
ht_doc = """ <html><head><title>Geeks For Geeks</title></head> <body> <p class="title"><b>most viewed courses in GFG,its all free</b></p> <p class ="prog">Top 5 Popular Programming Languages</p> class="prog" id="link1">Java</a> <a href="https://www.geeksforgeeks.org/cc-programs/" class="prog" \ id="link2">c/c++</a> class="prog" id="link3">Python</a> class="prog" id="link4">Javascript</a> class="prog" id="link5">Ruby</a> <p>according to an online survey. </p> <p class="prog"> Programming Languages</p> </body></html> """ |
Now let us navigate in all possible ways by applying BeautifulSoup in Python on the above code snippet, the most important component in Html documents are tags which may also contain other tags/strings(tag’s children). BeautifulSoup provides different ways to iterate over these children, let us see all possible cases
Navigating Downwards
Navigating Using Tag Names :
Example 1: To get Head Tag.
Use .head to BeautifulSoup object to get the head tag in HTML document.
Syntax : (BeautifulSoup Variable).head
Example 2: To get Title Tag
Use .title tag to retrieve the title of the HTML document embedded in BeautifulSoup variable
Syntax : (BeautifulSoup Variable).title
Code:
Python3
soup = BeautifulSoup(ht_doc, 'html.parser' ) print (soup.head) print (soup.title) |
Output:
<head><title>Geeks For Geeks</title></head> <title>Geeks For Geeks</title>
Example 3: To get a specific tag.
We can retrieve some specific tags like the first <b> tag in the body tag
Syntax : (BeautifulSoup Variable).body.b
Using tag name as an attribute will get you the first name of that name
Syntax: (BeautifulSoup Variable).(tag attribute)
By using find_all, we can get all contents associated with the attribute
Syntax: (BeautifulSoup Variable).find_all(tag value)
Code:
Python3
soup = BeautifulSoup(ht_doc, 'html.parser' ) # retrieving b tag element print (soup.body.b) # retrieving a tag element from BeautifulSoup assigned variable print (soup.a) # retrieving all elements tagged with a in ht_doc print (soup.find_all( "a" )) |
Output:
<b>most viewed courses in GFG,its all free</b>
<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>
[<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>,
<a class=”prog” href=”https://www.geeksforgeeks.org/cc-programs/” id=”link2″>c/c++</a>,
<a class=”prog” href=”https://www.geeksforgeeks.org/python-programming-examples/” id=”link3″>Python</a>,
<a class=”prog” href=”https://https://www.geeksforgeeks.org/introduction-to-javascript/” id=”link4″>Javascript</a>,
<a class=”prog” href=”https://www.geeksforgeeks.org/ruby-programming-language/” id=”link5″>Ruby</a>]
Example 4: Contents and .children
We can get tags children in a list by using .contents.
Syntax: (BeautifulSoup Variable).contents
Code:
Python3
soup = BeautifulSoup(ht_doc, 'html.parser' ) # assigning head tag of BeautifulSoup variable hTag = soup.head print (hTag) # retrieving contents of BeautifulSoup variable print (hTag.contents) |
Output:
<head><title>Geeks For Geeks</title></head> [<title>Geeks For Geeks</title>]
Example 5: .descendants
The .descendants attribute allows you to iterate over all of a tag’s children, recursively −its direct children and the children of its direct children and so on…
Syntax: (Variable assigned from BeautifulSoup Variable).descendants
Code:
Python3
# embedding html document inyto BeautifulSoup variable soup = BeautifulSoup(ht_doc, 'html.parser' ) # assigning head element of BeautifulSoup-assigned Variable htag = soup.head # iterating through child in descendants of htag variable for child in htag.descendants: print (child) |
Output :
<title>Geeks For Geeks</title> Geeks For Geeks
Example 6: .string
If the tag has only one child, and that child is a NavigableString, the child is made available as .string
However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None, we can see this practical working in below code.
Python3
soup = BeautifulSoup(ht_doc, 'html.parser' ) htag = soup.head print (htag.string) |
Output:
Geeks For Geeks
Example 7: .strings and stripped_strings
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator.
Python3
soup = BeautifulSoup(ht_doc, 'html.parser' ) for string in soup.strings : print ( repr (string)) |
Output :
'\n' 'Geeks For Geeks' '\n' '\n' 'most viewed courses in GFG,its all free' '\n' 'Top 5 Popular Programming Languages' '\n' 'Java' '\n' 'c/c++' '\n' 'Python' '\n' 'Javascript' '\n' 'Ruby' '\naccording to an online survey. ' '\n' ' Programming Languages' '\n'
For removal of extra whitespaces, we use .stripped_strings generator :
Python3
# embedding HTML document in BeautifulSoup-assigned variable soup = BeautifulSoup(ht_doc, 'html.parser' ) # iterating through string in stripped_strings of # BeautifulSoup assigned variable for string in soup.stripped_strings : print ( repr (string)) |
Output:
'Geeks For Geeks' 'most viewed courses in GFG,its all free' 'Top 5 Popular Programming Languages' 'Java' 'c/c++' 'Python' 'Javascript' 'Ruby' 'according to an online survey.' 'Programming Languages'
Navigating Upwards Through BeautifulSoup :
If we consider a “family tree” analogy, every tag and every string has a parent: the tag that contains it:
Example 1: .parent.
.parent tag is used for retrieving the element’s parent element
Syntax : (BeautifulSoup Variable).parent
Code:
Python3
ht_doc = """ <html><head><title>Geeks For Geeks</title></head> <body> <p class="title"><b>most viewed courses in GFG,its all free</b></p> <p class ="prog">Top 5 Popular Programming Languages</p> class="prog" id="link1">Java</a> <a href="https://www.geeksforgeeks.org/cc-programs/" class="prog" \ id="link2">c/c++</a> class="prog" id="link3">Python</a> class="prog" id="link4">Javascript</a> class="prog" id="link5">Ruby</a> according to an online survey. </a> <p class="prog"> Programming Languages</p> </body></html> """ from bs4 import BeautifulSoup soup = BeautifulSoup(ht_doc, 'html.parser' ) # embedding html document Itag = soup.title # assigning title tag of BeautifulSoup-assigned variable # to print parent element in Itag variable print (Itag.parent) htmlTag = soup.html print ( type (htmlTag.parent)) print (soup.parent) |
Output:
<head><title>Geeks For Geeks</title></head> <class 'bs4.BeautifulSoup'> None
Example 2: .parents
For iterating all over the parent elements, .parents tag can be used :
Syntax :(BeautifulSoup Variable).parents
Python3
# embedding html doc into BeautifulSoup soup = BeautifulSoup(ht_doc, 'html.parser' ) # embedding a tag into link variable link = soup.a print (link) # iterating through parent in link variable for parent in link.parents : # printing statement for Parent is empty case if parent is None : print (parent) else : print (parent.name) |
Output:
<a class=”prog” href=”https://www.geeksforgeeks.org/java-programming-examples/” id=”link1″>Java</a>
body
html
[document]
Navigating Sideways With BeautifulSoup
.next_sibling and .previous_sibling are the tags that are used for navigating between page elements that are on same level of the parse tree.
Syntax:
(BeautifulSoup Variable).(tag attribute).next_sibling
(BeautifulSoup Variable).(tag attribute).previous_sibling
Code:
Python3
from bs4 import BeautifulSoup sibling_soup = BeautifulSoup("<a><b>Geeks For Geeks< / b><c><strong>The \ Biggest Online Tutorials Library, It's all Free< / strong>< / b>< / a>") # to retrieve next sibling of b tag print (sibling_soup.b.next_sibling) # for retrieving previous sibling of c tag print (sibling_soup.c.previous_sibling) |
Output:
<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c> <b>Geeks For Geeks</b>