The Python library used to analyze data is known as Pandas. The most common way of reading data in Pandas is through the CSV file, but the limitation with the CSV file is it should be in a specific format, or else it will throw an error in tokenizing data. In this article, we will discuss the various ways to fix Python Pandas Error Tokenizing data.
What is Python Pandas Error Tokenizing Data?
The “Python Pandas Error Tokenizing Data” typically occurs when you are using the pandas.read_csv() function to read data from a CSV file, and the function encounters issues with tokenizing or parsing the data. Tokenization refers to the process of splitting the data into smaller units (tokens), usually based on a delimiter, in the case of CSV files, it’s typically a comma.
Fixing Python Pandas Error Tokenizing Data
- Check the CSV file
- Specify the delimiter
- Use the correct encoding
- Skip rows with errors
- Fix unbalanced quotes
Check the CSV file
As we are reading Python Pandas data through the CSV file, it is crucial to check if the CSV file we are uploading has any errors or not. To check if the CSV file has any errors or not, you can open the CSV file through any Excel or any of your favorite editors. In case, you find any error, correct the error and upload the correct CSV again.
Specify the Delimiter
The default delimiter used while reading the CSV file in Pandas data frame is comma ( , ). In case, you are using any other delimiter in the CSV file, then it’s necessary to specify that delimiter while reading of CSV file, else it will read the CSV file wrong or give the error tokenizing data. You can specify the delimiter while reading the CSV as follows:
Example: In this example, we are reading the CSV file which has data separated by semicolon, thus we have specified the delimiter, semicolon ( ; ) while reading the CSV file as follows:
Python3
import pandas as pd df = pd.read_csv( 'student_data1.csv' , sep = ';' ) df |
Output
Use the Correct Encoding
The default encoding used while reading the CSV file in Pandas data frame is utf-8. In case, you are using any special characters in the CSV file, then it’s crucial to use the correct encoding while reading of CSV file, else it will read the CSV file wrong or give the error tokenizing data. You can specify the correct encoding while reading the CSV as follows:
Example: In this example, the CSV file we are reading have special characters in it, thus while reading the CSV file, we are using the ascii encoding as follows:
Python3
import pandas as pd df = pd.read_csv( 'student_data1.csv' , encoding = 'ascii' ) df |
Output
Skip Rows with Errors
The default way of reading the uploaded CSV file is all the rows whether it has errors or not. In case, you know your data can have some rows which contains error, then it’s essential to specify the skipping the rows while reading of CSV file, else it will read the CSV file wrong or give the error tokenizing data. You can specify skipping the error rows while reading the CSV as follows:
Example: In this example, the CSV file we are reading have some rows containing errors in it, thus while reading the CSV file, we are skipping the rows containing error as follows:
Python3
import pandas as pd df = pd.read_csv( 'student_data1.csv' , on_bad_lines = 'skip' ) df |
Output:
Fix unbalanced Quotes
There occurs various circumstances the CSV file we are reading contains unbalanced quotes. In such case, it’s necessary to fix the unbalanced quotes while reading the CSV file only. In this method, we will see how we can fix those unbalanced quotes.
Example: In this example, the CSV file we are reading have some unbalanced double quotes in it, thus while reading the CSV file, we are fixing the unbalanced double quotes as follows:
Python3
import pandas as pd import csv df = pd.read_csv( 'student_data1.csv' , quoting = csv.QUOTE_NONE, quotechar = '"' ) df |
Output:
Conclusion:
The reading of incorrect CSV file in Python Pandas can give you the error tokenizing data, but the various ways defined in this article will help you solve the error and properly parse the CSV file in Pandas.