Parquet is a columnar storage format that is optimized for distributed processing of large datasets. It is widely used in Big Data processing systems like Hadoop and Apache Spark.
A partitioned parquet file is a parquet file that is partitioned into multiple smaller files based on the values of one or more columns. Partitioning can significantly improve query performance by allowing the processing system to read only the necessary files.
Concepts
- Parquet format: A columnar storage format that is optimized for distributed processing of large datasets.
- Partitioning: Dividing a dataset into smaller parts based on the values of one or more columns.
- Pandas DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
- pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.
Dataset link: [https://www.kaggle.com/datasets/pawankumargunjan/weather]
Example 1
Install the pyarrow package:
The pyarrow package provides a Python interface to the Arrow C++ library for working with columnar data. Install it using the following command
pip install pyarrow
Import the pandas or pyarrow packages:
Python3
import pyarrow.parquet as pa |
Reading parquet data using pyarrow.parquet
Python
table = pa.read_table( 'weather.2016.parquet' ) table |
Output:
pyarrow.Table ForecastSiteCode: int64 ObservationTime: int64 ObservationDate: timestamp[ms] WindDirection: int64 WindSpeed: int64 WindGust: double Visibility: double ScreenTemperature: double Pressure: double SignificantWeatherCode: int64 SiteName: string Latitude: double Longitude: double Region: string Country: string ---- ForecastSiteCode: [[3002,3005,3008,3017,3023,...,3882,3002,3005,3008,3017],[3023,3026,3031,3034,3037,...,3797,3866,3872,3876,3882]] ObservationTime: [[0,0,0,0,0,...,12,13,13,13,13],[13,13,13,13,13,...,23,23,23,23,23]] ObservationDate: [[2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,...,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000],[2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,...,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000]] WindDirection: [[12,10,8,6,10,...,4,8,8,8,8],[9,9,10,8,8,...,1,0,1,1,1]] WindSpeed: [[8,2,6,8,30,...,5,19,18,19,19],[25,26,24,23,23,...,5,10,2,3,2]] WindGust: [[null,null,null,null,37,...,null,null,null,null,29],[36,41,37,34,37,...,null,null,null,null,null]] Visibility: [[30000,35000,50000,40000,2600,...,4000,8000,3500,11000,28000],[4600,9000,30000,10000,2900,...,22000,null,50000,null,35000]] ScreenTemperature: [[2.1,0.1,2.8,1.6,9.8,...,10,-99,7.4,8.1,9.2],[9.1,9.5,10.2,9.7,9.9,...,4.9,8.4,3.5,6.1,3.7]] Pressure: [[997,997,997,996,991,...,1030,null,1019,1020,1019],[1019,1018,1020,1020,1021,...,1019,1018,1019,1019,1019]] SignificantWeatherCode: [[8,7,-99,8,11,...,1,5,15,12,7],[15,12,12,12,15,...,0,-99,0,-99,0]] ...
Print the shape of the dataset
Python3
table.shape |
Output:
(194697, 15)
Convert the pyarrow table dataset into a pandas dataframe.
Python3
df = table.to_pandas() # Taking tanspose so the printing dataset will easy. df.head().T |
Output:
|
0 |
1 |
2 |
3 |
4 |
ForecastSiteCode |
3002 |
3005 |
3008 |
3017 |
3023 |
ObservationTime |
0 |
0 |
0 |
0 |
0 |
ObservationDate |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
WindDirection |
12 |
10 |
8 |
6 |
10 |
WindSpeed |
8 |
2 |
6 |
8 |
30 |
WindGust |
NaN |
NaN |
NaN |
NaN |
37.0 |
Visibility |
30000.0 |
35000.0 |
50000.0 |
40000.0 |
2600.0 |
ScreenTemperature |
2.1 |
0.1 |
2.8 |
1.6 |
9.8 |
Pressure |
997.0 |
997.0 |
997.0 |
996.0 |
991.0 |
SignificantWeatherCode |
8 |
7 |
-99 |
8 |
11 |
SiteName |
BALTASOUND (3002) |
LERWICK (S. SCREEN) (3005) |
FAIR ISLE (3008) |
KIRKWALL (3017) |
SOUTH UIST RANGE (3023) |
Latitude |
60.749 |
60.139 |
59.53 |
58.954 |
57.358 |
Longitude |
-0.854 |
-1.183 |
-1.63 |
-2.9 |
-7.397 |
Region |
Orkney & Shetland |
Orkney & Shetland |
Orkney & Shetland |
Orkney & Shetland |
Highland & Eilean Siar |
Country |
SCOTLAND |
SCOTLAND |
None |
SCOTLAND |
SCOTLAND |
Example 2
Reading parquet data using pandas.read_parquet
Python
import pandas as pd df = pd.read_parquet( 'weather.2016.parquet' ) df.head() |
Output:
ForecastSiteCode | ObservationTime | ObservationDate | WindDirection | WindSpeed | WindGust | Visibility | ScreenTemperature | Pressure | SignificantWeatherCode | SiteName | Latitude | Longitude | Region | Country | |
0 | 3002 | 0 | 2016-02-01 | 12 | 8 | NaN | 30000.0 | 2.1 | 997.0 | 8 | BALTASOUND (3002) | 60.749 | -0.854 | Orkney & Shetland | SCOTLAND |
1 | 3005 | 0 | 2016-02-01 | 10 | 2 | NaN | 35000.0 | 0.1 | 997.0 | 7 | LERWICK (S. SCREEN) (3005) | 60.139 | -1.183 | Orkney & Shetland | SCOTLAND |
2 | 3008 | 0 | 2016-02-01 | 8 | 6 | NaN | 50000.0 | 2.8 | 997.0 | -99 | FAIR ISLE (3008) | 59.530 | -1.630 | Orkney & Shetland | None |
3 | 3017 | 0 | 2016-02-01 | 6 | 8 | NaN | 40000.0 | 1.6 | 996.0 | 8 | KIRKWALL (3017) | 58.954 | -2.900 | Orkney & Shetland | SCOTLAND |
4 | 3023 | 0 | 2016-02-01 | 10 | 30 | 37.0 | 2600.0 | 9.8 | 991.0 | 11 | SOUTH UIST RANGE (3023) | 57.358 | -7.397 | Highland & Eilean Siar | SCOTLAND |
Example 3
Filtering the parquet data
Python
import pandas as pd df = pd.read_parquet( 'weather.2016.parquet' , filters = [( 'Country' , '=' , 'ENGLAND' )]) df.head() |
Output
ForecastSiteCode | ObservationTime | ObservationDate | WindDirection | WindSpeed | WindGust | Visibility | ScreenTemperature | Pressure | SignificantWeatherCode | SiteName | Latitude | Longitude | Region | Country | |
0 | 3134 | 0 | 2016-02-01 | 1 | 1 | NaN | 2100.0 | 4.3 | 999.0 | 15 | GLASGOW/BISHOPTON (3134) | 55.907 | -4.533 | Strathclyde | ENGLAND |
1 | 3210 | 0 | 2016-02-01 | 8 | 11 | 33.0 | 3100.0 | 8.8 | 1005.0 | 7 | ST. BEES HEAD (3210) | 54.518 | -3.615 | North West England | ENGLAND |
2 | 3212 | 0 | 2016-02-01 | 11 | 16 | NaN | 4800.0 | 11.6 | 1004.0 | 12 | KESWICK (3212) | 54.614 | -3.157 | North West England | ENGLAND |
3 | 3214 | 0 | 2016-02-01 | 11 | 24 | 34.0 | 10000.0 | 10.0 | 1005.0 | 8 | WALNEY ISLAND (3214) | 54.125 | -3.257 | North West England | ENGLAND |
4 | 3220 | 0 | 2016-02-01 | 16 | -99 | NaN | 25000.0 | 11.1 | 1002.0 | 7 | CARLISLE (3220) | 54.933 | -2.963 | North West England | ENGLAND |
This will filter the parquet data by Country=’ENGLAND‘ and print the first 5 rows of the DataFrame.
Aggregating the parquet data
Python
grouped = df.groupby([ 'Country' ]).mean(numeric_only = True ) # Taking tanspose so the printing dataset will easy. print (grouped.T) |
Output
Country ENGLAND ForecastSiteCode 3560.622936 ObservationTime 11.517152 WindDirection 8.534412 WindSpeed 7.770786 WindGust 36.035424 Visibility 22431.530727 ScreenTemperature 5.336209 Pressure 1011.335307 SignificantWeatherCode -3.614757 Latitude 52.354470 Longitude -1.586393
Group the data by Country, calculate the mean for each group, and print the resulting data frame.
Example 4
Multiple filters
Python
import pandas as pd df = pd.read_parquet( 'weather.2016.parquet' , filters = [( 'Country' , '=' , 'ENGLAND' ), ( 'WindSpeed' , '<' , 7 )]) # Taking tanspose so the printing dataset will easy. df.head().T |
Output:
|
0 |
1 |
2 |
3 |
4 |
ForecastSiteCode |
3134 |
3220 |
3839 |
3220 |
3839 |
ObservationTime |
0 |
0 |
0 |
1 |
1 |
ObservationDate |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
2016-02-01 00:00:00 |
WindDirection |
1 |
16 |
16 |
16 |
16 |
WindSpeed |
1 |
-99 |
-99 |
-99 |
-99 |
WindGust |
NaN |
NaN |
NaN |
NaN |
NaN |
Visibility |
2100.0 |
25000.0 |
NaN |
15000.0 |
NaN |
ScreenTemperature |
4.3 |
11.1 |
12.1 |
11.2 |
12.3 |
Pressure |
999.0 |
1002.0 |
NaN |
1003.0 |
NaN |
SignificantWeatherCode |
15 |
7 |
-99 |
8 |
-99 |
SiteName |
GLASGOW/BISHOPTON (3134) |
CARLISLE (3220) |
EXETER AIRPORT (3839) |
CARLISLE (3220) |
EXETER AIRPORT (3839) |
Latitude |
55.907 |
54.933 |
50.737 |
54.933 |
50.737 |
Longitude |
-4.533 |
-2.963 |
-3.405 |
-2.963 |
-3.405 |
Region |
Strathclyde |
North West England |
South West England |
North West England |
South West England |
Country |
ENGLAND |
ENGLAND |
ENGLAND |
ENGLAND |
ENGLAND |
This will filter the parquet data by Country=’ENGLAND‘ and WindSpeed< 7 and print the first 5 rows of the DataFrame.
Conclusion
In conclusion, partitioning parquet files is a powerful way to optimize data storage and querying performance. By partitioning data based on one or more columns, you can easily filter, sort, and aggregate data within a subset of partitions, rather than having to scan the entire dataset.
In this article, we covered two methods for reading partitioned parquet files in Python: using pandas’ read_parquet() function and using pyarrow’s ParquetDataset class. We also provided several examples of how to read and filter partitioned parquet files using these methods with real-world weather data.
Overall, partitioning parquet files is an effective technique for optimizing data storage and retrieval. Whether you’re dealing with big data or just trying to improve query performance, partitioning can help you get the most out of your data.