Saturday, November 16, 2024
Google search engine
HomeLanguagesEncoding in BeautifulSoup

Encoding in BeautifulSoup

The character encoding plays a major role in the interpretation of the content of an HTML and XML document. A document does not only contain English characters but also non-English characters like Hebrew, Latin, Greek and much more. To let the parser know, which encoding method should be used, the documents will contain a dedicated tag and attribute to specify this. For example:

In HTML documents

<meta charset=”–encoding method name–” content=”text/html”>

In XML documents

<?xml version=”1.0″  encoding=”–encoding method name–“?>

These tags convey the browser which encoding method can be used for parsing. If the proper encoding method is not specified, either the content is rendered incorrectly or sometimes with the replacement character ‘�’. 

XML encoding methods 

The XML documents can be encoded in one of the formats listed below. 

  • UTF-8 
  • UTF-16
  • Latin1
  • US-ASCII
  • ISO-8859-1 to ISO-8859-10

Amongst these methods, UTF-8 is commonly found. UTF-16 allows 2 bytes for each character and the documents with ‘0xx’ are encoded by this method. Latin1 covers Western European characters.

HTML encoding methods

The HTML and HTML5 documents can be encoded by any one of the methods below.

  • UTF-8
  • UTF-16
  • ISO-8859-1
  • UTF-16BE (Big Indian)
  • UTF-16LE (Little Indian)
  • WINDOWS-874
  • WINDOWS-1250 to WINDOWS-1258

For HTML5 documents, mostly UTF-8 is recommended. ISO-8859-1 is mostly used with XHTML documents. Some methods like UTF-7, UTF-32, BOCU-1, CESU-8 are explicitly mentioned not to use as they replace most of the characters with replacement character ‘�’.

BeautifulSoup and encoding

The BeautifulSoup module, popularly imported as bs4, is a boon that makes HTML/XML parsing a cake-walk. It has a rich number of methods among which one helps to select contents by their tag name or by the attribute present in the tag, one helps to extract the content based on the hierarchy, printing content with indentation required for HTML, and so on. The bs4 module auto-detects the encoding method used in the documents and converts it to a suitable format efficiently. The returned BeautifulSoup object will have various attributes which give more information. However, sometimes it incorrectly predicts the encoding method. Thus, if the encoding method is known by the user, it is good to pass it as an argument. This article provides the various ways in which the encoding methods can be specified in the bs4 module.

original_encoding

The bs4 module has a sub-library called Unicode, Dammit that finds the encoded method and uses that to convert to Unicode characters. The original_encoding attribute is used to return the detected encoding method. 

Example 1 :

Given an HTML element parse it and find the encoding method used.

Python3




from bs4 import BeautifulSoup
 
# HTML element with content
h1 = b"<h1>Hello world!!</h1>"
 
# parsing with html parser
parsed = BeautifulSoup(h1, "html.parser")
 
# tag found
print("Tag found :", parsed.h1.name)
 
# the content inside the tag
print("Content :", parsed.h1.string)
 
# the encoded method
print("Encoding method :", parsed.original_encoding)
n


Output:

Here, the HTML element string is prefixed by ‘b‘, which means treat it as a byte literal. Thus, ASCII encoding method is detected and used by the parser. In real world situations, the original encoding will be the one mentioned in the HTML document

Example 2: 

Given a URL, parse the contents and find the original encoding method.

Python3




from bs4 import BeautifulSoup
import requests
 
 
# request the page from server
page = requests.get(URL)
 
# parse the contentes of the page
soup = BeautifulSoup(page.content, "html.parser")
 
# encoded method
print("Encoded method :", soup.original_encoding)


Output

Encoded method : utf-8

Verifying the output :

Python3




from bs4 import BeautifulSoup
 
soup=BeautifulSoup(page.content,"html.parser")
 
# fetching the <meta> tag's
# charset attribute
# of the content above
tag=soup.meta['charset']
 
print("Encoding method :",tag)


Output

Encoding method : UTF-8

from_encoding

This is a parameter that can be passed to the constructor BeautifulSoup(). This tells the bs4 module explicitly, which encoding method has to be used. This saves time and avoids incorrect parsing due to misprediction.

Example :

Python3




from bs4 import BeautifulSoup
 
# HTML element
input = b"<h1>\xa2\xf6`\xe0</h1>"
 
# parsing content
soup = BeautifulSoup(input)
 
print("Content :",soup.h1.string)
 
print("Encoding method :",soup.original_encoding)


If the below warning is generated:

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html5lib”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], “html5lib”)

  markup_type=markup_type))

Traceback (most recent call last):

  File “/home/98e5f50281480cda5f5e31e3bcafb085.py”, line 9, in <module>

    print(“Content :”,soup.h1.string)

UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1: ordinal not in range(128)

The editor in Lazyroar tried to parse it with ASCII and ended up with an error. The output of executing the same code in the local machine gave the following output :

But the content actually corresponds to “ISO-8859-8” and the interpreted characters are not the desired ones. Thus by explicitly mentioning the encoding method if known, the correct output will be given.

Python3




from bs4 import BeautifulSoup
 
# HTML element
input = b"<h1>\xa2\xf6`\xe0</h1>"
 
# parsing content
soup = BeautifulSoup(input, "html.parser", from_encoding="iso-8859-8")
 
print("Content :",soup.h1.string)
 
print("Encoding method :",soup.original_encoding)


Output:

Output encoding

When the parsed HTML content has to be given as output, by default bs4 module delivers it as UTF-8 encoded document or sometimes with the mispredicted ones. If You want a document to be encoded by other methods without passing to the constructor, the following can be done :

  • prettify() : This method is used to print the HTML content with correct indentation. The encoding method to be used can be passed as a parameter to this method, so that while printing it modifies the encoding method also.

Example :

Python3




# import module
from bs4 import BeautifulSoup
 
# HTML element
input = b'''<html>
<meta charset="iso-8859-8"/>
<body>
<h1>\xa2\xf6`\xe0</h1>
</body>
</html>'''
 
# parsing content
soup = BeautifulSoup(input,"html.parser")
 
print(soup.prettify())


 Output:

 

Here, you can see the <meta> tag where encoding is set as UTF-8. To prevent this, one can write as below.

Python3




from bs4 import BeautifulSoup
 
# HTML element
input = b'''<html>
<meta charset="iso-8859-8"/>
<body>
<h1>\xa2\xf6`\xe0</h1>
</body>
</html>'''
 
# parsing content
soup = BeautifulSoup(input,"html.parser")
 
print(soup.prettify("iso-8859-8"))


Output:

b'<html>\n <meta charset="iso-8859-8"/>\n <body>\n  <h1>\n   \xa2\xf6`\xe0\n  </h1>\n </body>\n</html>'
  • encode() : The encoding method can be used to explicitly pass the required method. This replaces characters with the corresponding XML references.

Example :

Python3




from bs4 import BeautifulSoup
 
# HTML element
input = b"<html><head></head><body><h1>\xa2\xf6`\xe0</h1></body></html>"
 
# parsing content
soup = BeautifulSoup(input)
 
print("Content :",soup.h1.string)
 
print("Encoding method :",soup.original_encoding)
 
print("After explicit encoding :",soup.html.encode("iso-8859-8"))


Output:

RELATED ARTICLES

Most Popular

Recent Comments