Thursday, December 26, 2024
Google search engine
HomeLanguagesHow to resolve a UnicodeDecodeError for a CSV file in Python?

How to resolve a UnicodeDecodeError for a CSV file in Python?

Several errors can arise when an attempt to decode a byte string from a certain coding scheme is made. The reason is the inability of some encoding schemes to represent all code points. One of the most common errors during these conversions is UnicodeDecode Error which occurs when decoding a byte string by an incorrect coding scheme. This article will teach you how to resolve a UnicodeDecodeError for a CSV file in Python.

Why does the UnicodeDecodeError error arise?

The error occurs when an attempt to represent code points outside the range of the coding is made. To solve the issue, the byte string should be decoded using the same coding scheme in which it was encoded. i.e., The encoding scheme should be the same when the string is encoded and decoded. 

For demonstration, the same error would be reproduced and then fixed. In the below code, firstly the character a (byte string) is decoded using ASCII encoding successfully. Then an attempt to decode the byte string a\xf1 is made, which led to an error. This is because the ASCII encoding standard only allows representation of the characters within the range 0 to 127. Any attempt to address a character outside this range would lead to the ordinal not-in-range error.

Python3




t = b"a".decode("ascii")
  
# Produces error
t1 = b"a\xf1".decode("ascii")


Output:

Traceback (most recent call last):
 File "C:/Users/Sauleyayan/PycharmProjects/untitled1/venv/mad philes.py", line 5, in <module>
   t1 = b"a\xf1".decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 1: ordinal not in range(128)

To rectify the error, an encoding scheme would be used that would be sufficient to represent the \xf1 code point. In this case, the unicode_escape coding scheme would be used:

Python3




t1 = b"a\xf1".decode("unicode_escape")
  
print(t1)


Output:

How to Resolve a UnicodeDecodeError for a CSV file 

It is common to encounter the error mentioned above when processing a CSV file. This is because the CSV file may have a different encoding than the one used by the Python program. To fix such an error, the encoding used in the CSV file would be specified while opening the file. If the encoding standard of the CSV file is known, the Python interpreter could be instructed to use a specific encoding standard while that CSV file is being opened. This method is only usable if the encoding of the CSV is known.

To demonstrate the occurrence of the error, the following CSV file will be used:

resolve a UnicodeDecodeError for a CSV file in Python

The encoding of the CSV file is UTF-16

Generating UnicodeDecodeError for a CSV file 

The following code attempts to open the CSV file for processing. The above code, upon execution, led to the following error:

Python3




import pandas as pd
  
path = "test.csv"
  
# The following statement reads the csv file at the given path
# While decoding the contents of the file in utf-8 decoding standard
file = pd.read_csv(path)
  
print(file.head())


Output:

resolve a UnicodeDecodeError for a CSV file in Python

 

Understanding the Problem

The error occurred as the read_csv method could not decode the contents of the CSV file by using the default encoding, UTF-8. This is because the encoding of the file is UTF-16. Hence the encoding of the CSV file needs to be mentioned while opening the CSV file to fix the error and allow the processing of the CSV file.

Solution

Firstly, the pandas‘ library is imported, and the path to the CSV file is specified. Then the program calls the read_csv function to read the contents of the CSV file specified by the path and also passes the encoding through which the CSV file must be decoded (UTF-16 in this case). Since the decoding scheme mentioned in the argument is the one with which the CSV file was originally encoded, the file gets decoded successfully. 

Python3




import pandas as pd
  
path = "test.csv"
  
# The following statement reads the csv file at the given path
# While decoding the contents of the file in utf-8 decoding standard
file = pd.read_csv(path, encoding="utf-16")
  
# Displaying the contents
print(file.head())


Output:

resolve a UnicodeDecodeError for a CSV file in Python

 

Alternate Method to Solve UnicodeDecodeError

Another way of resolving the issue is by changing the encoding of the CSV file itself. For that, firstly, open the CSV file as a text file (using notepad or Wordpad):

resolve a UnicodeDecodeError for a CSV file in Python

 

Now go to file and select Save as:

resolve a UnicodeDecodeError for a CSV file in Python

 

A prompt would appear, and from there, select the encoding option and change it to UTF-8 (the default for Python and pandas), and select Save.

resolve a UnicodeDecodeError for a CSV file in Python

 

resolve a UnicodeDecodeError for a CSV file in Python

 

Now the following code would run without errors

The code ran without errors. This is because the default encoding of the CSV file was changed to UTF-8 before opening it with pandas. Since the default encoding used by pandas is UTF-8, the CSV file opened without error. 

Python3




import pandas as pd
  
path = "test.csv"
  
# The following statement reads the csv file at the given path
# While decoding the contents of the file in utf-8 decoding standard
file = pd.read_csv(path)
  
print(file.head())


Output:

resolve a UnicodeDecodeError for a CSV file in Python

 

RELATED ARTICLES

Most Popular

Recent Comments