Saturday, November 16, 2024
Google search engine
HomeLanguagesPython Encode Unicode and non-ASCII characters into JSON

Python Encode Unicode and non-ASCII characters into JSON

This article will provide a comprehensive guide on how to work with Unicode and non-ASCII characters in Python when generating and parsing JSON data. We will look at the different ways to handle Unicode and non-ASCII characters in JSON. By the end of this article, you should have a good understanding of how to work with Unicode and non-ASCII characters in JSON using Python. Also, we are going to cover the following topics related to encoding and serializing Unicode and non-ASCII characters in Python:

  1. How to encode Unicode and non-ASCII characters into JSON in Python.
  2. How to save non-ASCII or Unicode data as-is, without converting it to a \u escape sequence, in JSON.
  3. How to serialize Unicode data and write it into a file.
  4. How to serialize Unicode objects into UTF-8 JSON strings, instead of \u escape sequences.
  5. How to escape non-ASCII characters while encoding them into JSON in Python.

What is a UTF-8 Character?

Unicode is a standardized encoding system that represents most of the world’s written languages. It includes characters from many different scripts, such as Latin, Greek, and Chinese, and is capable of representing a wide range of characters and symbols. Non-ASCII characters are characters that are not part of the ASCII (American Standard Code for Information Interchange) character set, which consists of only 128 characters.

UTF-8 is a character encoding that represents each Unicode code point using one to four bytes. It is the most widely used character encoding for the Web and is supported by all modern web browsers and most other applications. UTF-8 is also backward-compatible with ASCII, so any ASCII text is also a valid UTF-8 text.

What is JSON?

The JSON module is a built-in module in Python that provides support for working with JSON (JavaScript Object Notation) data. It provides methods for encoding and decoding JSON objects, as well as for working with the data structures that represent them. The json.dumps() method is a method of the JSON module that serializes an object (e.g. a Python dictionary or list) to a JSON-formatted string. This string can then be saved to a file, sent over a network connection, or used in any other way that requires the data to be represented as a string.

Example 

BHere is how you could use the json.dumps() method to encode a Python dictionary as a JSON string.

Python3




import json
  
# Define a dictionary
my_dict = {
    "name": "John Doe",
    "age": 35,
    "email": "john.doe@example.com"
}
  
# Use the json.dumps() method to encode 
# the dictionary as a JSON string
json_str = json.dumps(my_dict)
  
# Print the JSON string
print(json_str)


Output:

{"name": "John Doe", "age": 35, "email": "john.doe@example.com"}

Save non-ASCII or Unicode data as-is, not as \u escape sequence in JSON

By default, Python’s JSON module converts non-ASCII and Unicode characters into the \u escape sequence when encoding them into JSON data. This escape sequence consists of a backslash followed by a u and four hexadecimal digits, which represent the Unicode code point of the character. To save non-ASCII or Unicode characters as-is, without the \u escape sequence, you can use the json.dumps() function with the ensure_ascii parameter set to False. This will allow the JSON module to preserve the original encoding of the characters when generating the JSON data.

Python3




import json
  
data = {'name': 'école'}
  
# Encode the data with the ensure_ascii parameter set to False
json_data = json.dumps(data, ensure_ascii=False)
  
print(json_data)


Output:

{"name": "école"}

JSON Serialize Unicode Data and Write it into a file

To serialize Unicode data and write it to a file in JSON format, you can use the json.dump() function. This function takes a Python object and a file object encodes the object into JSON data and writes it to the file.

Python3




import json
  
data = {'name': 'école'}
  
# Open a file for writing
with open('data.json', 'w', encoding='utf-8') as f:
    # Serialize the data and write it to the file
    json.dump(data, f, ensure_ascii=False)


Output:

 

Serialize Unicode objects into UTF-8 JSON strings instead of \u escape sequence

By default, the JSON module encodes Unicode objects (such as str and Unicode) into the \u escape sequence when generating JSON data. However, you can serialize Unicode objects into UTF-8 JSON strings by using the json.dumps() function with the encoding parameter set to ‘UTF-8’.

Python3




import json
  
data = {'name': 'école'}
  
# Encode the data with the encoding parameter set to 'utf-8'
json_data = json.dumps(data, encoding='utf-8', ensure_ascii=False)
  
print(json_data)


Output: 

{"name": "école"}

Encode both Unicode and ASCII into JSON

Here is an example of how to encode both Unicode and ASCII characters into JSON data using the json.dumps() function.

Python3




import json
  
data = {'name': 'école'
        'location': 'New York'}
  
# Encode the data with the ensure_ascii 
# parameter set to False
json_data = json.dumps(data, 
                       ensure_ascii=False)
  
print(json_data)


Output: 

{"name": "école", "location": "New York"}

Python Escape non-ASCII characters while encoding them into JSON:

To escape non-ASCII characters while encoding them into JSON data, you can use the json.dumps() function with the ensure_ascii parameter set to True. This will cause the JSON module to convert all non-ASCII characters into the \u escape sequence.

Python3




import json
  
data = {'name': 'école'}
  
# Encode the data with the ensure_ascii 
# parameter set to True
json_data = json.dumps(data, 
                       ensure_ascii=True)
  
print(json_data)


Output:

{"name": "\u00e9cole"}

RELATED ARTICLES

Most Popular

Recent Comments