Several errors can arise when an attempt to decode a byte string from a certain coding scheme is made. The reason is the inability of some encoding schemes to represent all code points. One of the most common errors during these conversions is UnicodeDecode Error which occurs when decoding a byte string by an incorrect coding scheme. This article will teach you how to resolve a UnicodeDecodeError for a CSV file in Python.
Why does the UnicodeDecodeError error arise?
The error occurs when an attempt to represent code points outside the range of the coding is made. To solve the issue, the byte string should be decoded using the same coding scheme in which it was encoded. i.e., The encoding scheme should be the same when the string is encoded and decoded.
For demonstration, the same error would be reproduced and then fixed. In the below code, firstly the character a (byte string) is decoded using ASCII encoding successfully. Then an attempt to decode the byte string a\xf1 is made, which led to an error. This is because the ASCII encoding standard only allows representation of the characters within the range 0 to 127. Any attempt to address a character outside this range would lead to the ordinal not-in-range error.
Python3
t = b "a" .decode( "ascii" ) # Produces error t1 = b "a\xf1" .decode( "ascii" ) |
Output:
Traceback (most recent call last): File "C:/Users/Sauleyayan/PycharmProjects/untitled1/venv/mad philes.py", line 5, in <module> t1 = b"a\xf1".decode("ascii") UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 1: ordinal not in range(128)
To rectify the error, an encoding scheme would be used that would be sufficient to represent the \xf1 code point. In this case, the unicode_escape coding scheme would be used:
Python3
t1 = b "a\xf1" .decode( "unicode_escape" ) print (t1) |
Output:
añ
How to Resolve a UnicodeDecodeError for a CSV file
It is common to encounter the error mentioned above when processing a CSV file. This is because the CSV file may have a different encoding than the one used by the Python program. To fix such an error, the encoding used in the CSV file would be specified while opening the file. If the encoding standard of the CSV file is known, the Python interpreter could be instructed to use a specific encoding standard while that CSV file is being opened. This method is only usable if the encoding of the CSV is known.
To demonstrate the occurrence of the error, the following CSV file will be used:
Generating UnicodeDecodeError for a CSV file
The following code attempts to open the CSV file for processing. The above code, upon execution, led to the following error:
Python3
import pandas as pd path = "test.csv" # The following statement reads the csv file at the given path # While decoding the contents of the file in utf-8 decoding standard file = pd.read_csv(path) print ( file .head()) |
Output:
Understanding the Problem
The error occurred as the read_csv method could not decode the contents of the CSV file by using the default encoding, UTF-8. This is because the encoding of the file is UTF-16. Hence the encoding of the CSV file needs to be mentioned while opening the CSV file to fix the error and allow the processing of the CSV file.
Solution
Firstly, the pandas‘ library is imported, and the path to the CSV file is specified. Then the program calls the read_csv function to read the contents of the CSV file specified by the path and also passes the encoding through which the CSV file must be decoded (UTF-16 in this case). Since the decoding scheme mentioned in the argument is the one with which the CSV file was originally encoded, the file gets decoded successfully.
Python3
import pandas as pd path = "test.csv" # The following statement reads the csv file at the given path # While decoding the contents of the file in utf-8 decoding standard file = pd.read_csv(path, encoding = "utf-16" ) # Displaying the contents print ( file .head()) |
Output:
Alternate Method to Solve UnicodeDecodeError
Another way of resolving the issue is by changing the encoding of the CSV file itself. For that, firstly, open the CSV file as a text file (using notepad or Wordpad):
Now go to file and select Save as:
A prompt would appear, and from there, select the encoding option and change it to UTF-8 (the default for Python and pandas), and select Save.
Now the following code would run without errors
The code ran without errors. This is because the default encoding of the CSV file was changed to UTF-8 before opening it with pandas. Since the default encoding used by pandas is UTF-8, the CSV file opened without error.
Python3
import pandas as pd path = "test.csv" # The following statement reads the csv file at the given path # While decoding the contents of the file in utf-8 decoding standard file = pd.read_csv(path) print ( file .head()) |
Output: