Read a Parquet File Using Pandas

23 July 2024

1

Parquet is a columnar storage format that is optimized for distributed processing of large datasets. It is widely used in Big Data processing systems like Hadoop and Apache Spark.

A partitioned parquet file is a parquet file that is partitioned into multiple smaller files based on the values of one or more columns. Partitioning can significantly improve query performance by allowing the processing system to read only the necessary files.

Concepts

Parquet format: A columnar storage format that is optimized for distributed processing of large datasets.
Partitioning: Dividing a dataset into smaller parts based on the values of one or more columns.
Pandas DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.

Dataset link: [https://www.kaggle.com/datasets/pawankumargunjan/weather]

Example 1

Install the pyarrow package:

The pyarrow package provides a Python interface to the Arrow C++ library for working with columnar data. Install it using the following command

pip install pyarrow

Import the pandas or pyarrow packages:

Python3

import pyarrow.parquet as pa

Reading parquet data using pyarrow.parquet

Python

table = pa.read_table('weather.2016.parquet') 
table

Output:

pyarrow.Table
ForecastSiteCode: int64
ObservationTime: int64
ObservationDate: timestamp[ms]
WindDirection: int64
WindSpeed: int64
WindGust: double
Visibility: double
ScreenTemperature: double
Pressure: double
SignificantWeatherCode: int64
SiteName: string
Latitude: double
Longitude: double
Region: string
Country: string
----
ForecastSiteCode: [[3002,3005,3008,3017,3023,...,3882,3002,3005,3008,3017],[3023,3026,3031,3034,3037,...,3797,3866,3872,3876,3882]]
ObservationTime: [[0,0,0,0,0,...,12,13,13,13,13],[13,13,13,13,13,...,23,23,23,23,23]]
ObservationDate: [[2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,...,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000],[2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,...,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000]]
WindDirection: [[12,10,8,6,10,...,4,8,8,8,8],[9,9,10,8,8,...,1,0,1,1,1]]
WindSpeed: [[8,2,6,8,30,...,5,19,18,19,19],[25,26,24,23,23,...,5,10,2,3,2]]
WindGust: [[null,null,null,null,37,...,null,null,null,null,29],[36,41,37,34,37,...,null,null,null,null,null]]
Visibility: [[30000,35000,50000,40000,2600,...,4000,8000,3500,11000,28000],[4600,9000,30000,10000,2900,...,22000,null,50000,null,35000]]
ScreenTemperature: [[2.1,0.1,2.8,1.6,9.8,...,10,-99,7.4,8.1,9.2],[9.1,9.5,10.2,9.7,9.9,...,4.9,8.4,3.5,6.1,3.7]]
Pressure: [[997,997,997,996,991,...,1030,null,1019,1020,1019],[1019,1018,1020,1020,1021,...,1019,1018,1019,1019,1019]]
SignificantWeatherCode: [[8,7,-99,8,11,...,1,5,15,12,7],[15,12,12,12,15,...,0,-99,0,-99,0]]
...

Print the shape of the dataset

Python3

table.shape

Output:

(194697, 15)

Convert the pyarrow table dataset into a pandas dataframe.

Python3

df = table.to_pandas() 
# Taking tanspose so the printing dataset will easy. 
df.head().T

Output:

	0	1	2	3	4
ForecastSiteCode	3002	3005	3008	3017	3023
ObservationTime	0	0	0	0	0
ObservationDate	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00
WindDirection	12	10	8	6	10
WindSpeed	8	2	6	8	30
WindGust	NaN	NaN	NaN	NaN	37.0
Visibility	30000.0	35000.0	50000.0	40000.0	2600.0
ScreenTemperature	2.1	0.1	2.8	1.6	9.8
Pressure	997.0	997.0	997.0	996.0	991.0
SignificantWeatherCode	8	7	-99	8	11
SiteName	BALTASOUND (3002)	LERWICK (S. SCREEN) (3005)	FAIR ISLE (3008)	KIRKWALL (3017)	SOUTH UIST RANGE (3023)
Latitude	60.749	60.139	59.53	58.954	57.358
Longitude	-0.854	-1.183	-1.63	-2.9	-7.397
Region	Orkney & Shetland	Orkney & Shetland	Orkney & Shetland	Orkney & Shetland	Highland & Eilean Siar
Country	SCOTLAND	SCOTLAND	None	SCOTLAND	SCOTLAND

Example 2

Reading parquet data using pandas.read_parquet

Python

import pandas as pd 
  
df = pd.read_parquet('weather.2016.parquet') 
df.head()

Output:

	ForecastSiteCode	ObservationTime	ObservationDate	WindDirection	WindSpeed	WindGust	Visibility	ScreenTemperature	Pressure	SignificantWeatherCode	SiteName	Latitude	Longitude	Region	Country
0	3002	0	2016-02-01	12	8	NaN	30000.0	2.1	997.0	8	BALTASOUND (3002)	60.749	-0.854	Orkney & Shetland	SCOTLAND
1	3005	0	2016-02-01	10	2	NaN	35000.0	0.1	997.0	7	LERWICK (S. SCREEN) (3005)	60.139	-1.183	Orkney & Shetland	SCOTLAND
2	3008	0	2016-02-01	8	6	NaN	50000.0	2.8	997.0	-99	FAIR ISLE (3008)	59.530	-1.630	Orkney & Shetland	None
3	3017	0	2016-02-01	6	8	NaN	40000.0	1.6	996.0	8	KIRKWALL (3017)	58.954	-2.900	Orkney & Shetland	SCOTLAND
4	3023	0	2016-02-01	10	30	37.0	2600.0	9.8	991.0	11	SOUTH UIST RANGE (3023)	57.358	-7.397	Highland & Eilean Siar	SCOTLAND

Example 3

Filtering the parquet data

Python

import pandas as pd 
  
df = pd.read_parquet('weather.2016.parquet', filters=[('Country', '=', 'ENGLAND')]) 
df.head()

Output

	ForecastSiteCode	ObservationTime	ObservationDate	WindDirection	WindSpeed	WindGust	Visibility	ScreenTemperature	Pressure	SignificantWeatherCode	SiteName	Latitude	Longitude	Region	Country
0	3134	0	2016-02-01	1	1	NaN	2100.0	4.3	999.0	15	GLASGOW/BISHOPTON (3134)	55.907	-4.533	Strathclyde	ENGLAND
1	3210	0	2016-02-01	8	11	33.0	3100.0	8.8	1005.0	7	ST. BEES HEAD (3210)	54.518	-3.615	North West England	ENGLAND
2	3212	0	2016-02-01	11	16	NaN	4800.0	11.6	1004.0	12	KESWICK (3212)	54.614	-3.157	North West England	ENGLAND
3	3214	0	2016-02-01	11	24	34.0	10000.0	10.0	1005.0	8	WALNEY ISLAND (3214)	54.125	-3.257	North West England	ENGLAND
4	3220	0	2016-02-01	16	-99	NaN	25000.0	11.1	1002.0	7	CARLISLE (3220)	54.933	-2.963	North West England	ENGLAND

This will filter the parquet data by Country=’ENGLAND‘ and print the first 5 rows of the DataFrame.

Aggregating the parquet data

Python

grouped = df.groupby(['Country']).mean(numeric_only=True) 
# Taking tanspose so the printing dataset will easy. 
print(grouped.T)

Output

Country                      ENGLAND
ForecastSiteCode         3560.622936
ObservationTime            11.517152
WindDirection               8.534412
WindSpeed                   7.770786
WindGust                   36.035424
Visibility              22431.530727
ScreenTemperature           5.336209
Pressure                 1011.335307
SignificantWeatherCode     -3.614757
Latitude                   52.354470
Longitude                  -1.586393

Group the data by Country, calculate the mean for each group, and print the resulting data frame.

Example 4

Multiple filters

Python

import pandas as pd 
df = pd.read_parquet('weather.2016.parquet',  
                     filters=[('Country', '=', 'ENGLAND'), 
                              ('WindSpeed','<', 7)]) 
# Taking tanspose so the printing dataset will easy. 
df.head().T

Output:

	0	1	2	3	4
ForecastSiteCode	3134	3220	3839	3220	3839
ObservationTime	0	0	0	1	1
ObservationDate	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00
WindDirection	1	16	16	16	16
WindSpeed	1	-99	-99	-99	-99
WindGust	NaN	NaN	NaN	NaN	NaN
Visibility	2100.0	25000.0	NaN	15000.0	NaN
ScreenTemperature	4.3	11.1	12.1	11.2	12.3
Pressure	999.0	1002.0	NaN	1003.0	NaN
SignificantWeatherCode	15	7	-99	8	-99
SiteName	GLASGOW/BISHOPTON (3134)	CARLISLE (3220)	EXETER AIRPORT (3839)	CARLISLE (3220)	EXETER AIRPORT (3839)
Latitude	55.907	54.933	50.737	54.933	50.737
Longitude	-4.533	-2.963	-3.405	-2.963	-3.405
Region	Strathclyde	North West England	South West England	North West England	South West England
Country	ENGLAND	ENGLAND	ENGLAND	ENGLAND	ENGLAND

This will filter the parquet data by Country=’ENGLAND‘ and WindSpeed< 7 and print the first 5 rows of the DataFrame.

Conclusion

In conclusion, partitioning parquet files is a powerful way to optimize data storage and querying performance. By partitioning data based on one or more columns, you can easily filter, sort, and aggregate data within a subset of partitions, rather than having to scan the entire dataset.

In this article, we covered two methods for reading partitioned parquet files in Python: using pandas’ read_parquet() function and using pyarrow’s ParquetDataset class. We also provided several examples of how to read and filter partitioned parquet files using these methods with real-world weather data.

Overall, partitioning parquet files is an effective technique for optimizing data storage and retrieval. Whether you’re dealing with big data or just trying to improve query performance, partitioning can help you get the most out of your data.

Read a Parquet File Using Pandas

Concepts

Example 1

Install the pyarrow package:

Import the pandas or pyarrow packages:

Python3

Reading parquet data using pyarrow.parquet

Python

Print the shape of the dataset

Python3

Convert the pyarrow table dataset into a pandas dataframe.

Python3

Example 2

Reading parquet data using pandas.read_parquet

Python

Example 3

Filtering the parquet data

Python

Aggregating the parquet data

Python

Example 4

Multiple filters

Python

Conclusion

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US