Friday, December 27, 2024
Google search engine
HomeLanguagesPython Pandas – Flatten nested JSON

Python Pandas – Flatten nested JSON

Most of the data extracted from the web through scraping are in the form of JSON datatype because JSON is a preferred datatype for transmitting data in web applications. The reason JSON is preferred is that it’s extremely lightweight to send back and forth in HTTP requests and responses due to the small file size. 

Such a JSON file can sometimes be clumsy having different levels and hierarchy. Such raw formats of data cannot be used for further processing. It is general practice to convert the JSON data structure to a Pandas Dataframe as it can help to manipulate and visualize the data more conveniently. In this article, let us consider different nested JSON data structures and flatten them using inbuilt and custom-defined functions. 

Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables.

Syntax: pandas.json_normalize(data, errors=’raise’, sep=’.’, max_level=None)

Parameters:

  • data – dict or list of dicts
  • errors – {‘raise’, ‘ignore’}, default ‘raise’
  • sep – str, default ‘.’ nested records will generate names separated by a specified separator.
  • max_level – int, default None. Max number of levels(depth of dict) to normalize.

Example 1:

Consider a list of nested dictionaries that contains details about the students and their marks as shown. Use pandas json_normalize on this JSON data structure to flatten it to a flat table as shown

Python3




import pandas as pd
  
data = [
    {"Roll no": 1,
     "student": {"first_name": "Ram", "last_name": "kumar"}
     },
    {"student": {"English": "95", "Math": "88"}
     },
    {"Roll no": 2,
     "student": {"first_name": "Joseph", "English": "90", "Science": "82"}
     },
    {"Roll no": 3,
     "student": {"first_name": "abinaya", "last_name": "devi"},
     "student": {"English": "91", "Math": "98"}
     },
]
  
df = pd.json_normalize(data)
print(df)


Output:

Example 2:

Now let us make use of the max_level option to flatten a slightly complicated JSON structure to a flat table. For this example, we have considered the max_level of 0, which means flattening only the first level of JSON and can experiment with the results.

Here, we have considered an example of the health records of different individuals in JSON format. 

Python3




import pandas as pd
  
data = [
    {
        "id": 1,
        "candidate": "Roberto mathews",
        "health_index": {"bmi": 22, "blood_pressure": 130},
    },
    {"candidate": "Shane wade", "health_index": {"bmi": 28, "blood_pressure": 160}},
    {
        "id": 2,
        "candidate": "Bruce tommy",
        "health_index": {"bmi": 31, "blood_pressure": 190},
    },
]
pd.json_normalize(data, max_level=0)


Output:

As we have used only 1 level of flattening, The second level is retained as a key-value pair as shown

Example 3:

Now let us use the same JSON data structure as above, with max_level of 1, which means flattening the first two levels of JSON and can experiment with the results.

Python3




import pandas as pd
  
data = [
    {
        "id": 1,
        "candidate": "Roberto mathews",
        "health_index": {"bmi": 22, "blood_pressure": 130},
    },
    {"candidate": "Shane wade", "health_index": {"bmi": 28, "blood_pressure": 160}},
    {
        "id": 2,
        "candidate": "Bruce tommy",
        "health_index": {"bmi": 31, "blood_pressure": 190},
    },
]
pd.json_normalize(data, max_level=1)


Output:

Here, unlike the before example we flatten the first two rows of the JSON which gives a complete structure for the flat table.

Example 4:

Finally, let us consider a deeply nested JSON structure that can be converted to a flat table by passing the meta arguments to the json_normalize function as shown below.

Here, in the below code, we have passed the chronological order in which the JSON has to be parsed to a flat table. In the below code, we first suggest the department key to be parsed followed by the company and tagline, then, we pass the management and CEO key as a nested list indicating that they have to be parsed as a single field.

Python3




import pandas as pd
data = [
    {
        "company": "Google",
        "tagline": "Dont be evil",
        "management": {"CEO": "Sundar Pichai"},
        "department": [
            {"name": "Gmail", "revenue (bn)": 123},
            {"name": "GCP", "revenue (bn)": 400},
            {"name": "Google drive", "revenue (bn)": 600},
        ],
    },
    {
        "company": "Microsoft",
        "tagline": "Be What's Next",
        "management": {"CEO": "Satya Nadella"},
        "department": [
            {"name": "Onedrive", "revenue (bn)": 13},
            {"name": "Azure", "revenue (bn)": 300},
            {"name": "Microsoft 365", "revenue (bn)": 300},
        ],
    },
  
]
result = pd.json_normalize(
    data, "department", ["company", "tagline", ["management", "CEO"]]
)
result


Output:

RELATED ARTICLES

Most Popular

Recent Comments