Saturday, November 16, 2024
Google search engine
HomeLanguagesRead a Parquet File Using Pandas

Read a Parquet File Using Pandas

Parquet is a columnar storage format that is optimized for distributed processing of large datasets. It is widely used in Big Data processing systems like Hadoop and Apache Spark

A partitioned parquet file is a parquet file that is partitioned into multiple smaller files based on the values of one or more columns. Partitioning can significantly improve query performance by allowing the processing system to read only the necessary files.

Concepts

  • Parquet format: A columnar storage format that is optimized for distributed processing of large datasets.
  • Partitioning: Dividing a dataset into smaller parts based on the values of one or more columns.
  • Pandas DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
  • pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.

Dataset link: [https://www.kaggle.com/datasets/pawankumargunjan/weather]

Example 1

Install the pyarrow package:

The pyarrow package provides a Python interface to the Arrow C++ library for working with columnar data. Install it using the following command

pip install pyarrow

Import the pandas or pyarrow packages:

Python3




import pyarrow.parquet as pa


Reading parquet data using pyarrow.parquet

Python




table = pa.read_table('weather.2016.parquet')
table


Output:

pyarrow.Table
ForecastSiteCode: int64
ObservationTime: int64
ObservationDate: timestamp[ms]
WindDirection: int64
WindSpeed: int64
WindGust: double
Visibility: double
ScreenTemperature: double
Pressure: double
SignificantWeatherCode: int64
SiteName: string
Latitude: double
Longitude: double
Region: string
Country: string
----
ForecastSiteCode: [[3002,3005,3008,3017,3023,...,3882,3002,3005,3008,3017],[3023,3026,3031,3034,3037,...,3797,3866,3872,3876,3882]]
ObservationTime: [[0,0,0,0,0,...,12,13,13,13,13],[13,13,13,13,13,...,23,23,23,23,23]]
ObservationDate: [[2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,...,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000],[2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,...,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000]]
WindDirection: [[12,10,8,6,10,...,4,8,8,8,8],[9,9,10,8,8,...,1,0,1,1,1]]
WindSpeed: [[8,2,6,8,30,...,5,19,18,19,19],[25,26,24,23,23,...,5,10,2,3,2]]
WindGust: [[null,null,null,null,37,...,null,null,null,null,29],[36,41,37,34,37,...,null,null,null,null,null]]
Visibility: [[30000,35000,50000,40000,2600,...,4000,8000,3500,11000,28000],[4600,9000,30000,10000,2900,...,22000,null,50000,null,35000]]
ScreenTemperature: [[2.1,0.1,2.8,1.6,9.8,...,10,-99,7.4,8.1,9.2],[9.1,9.5,10.2,9.7,9.9,...,4.9,8.4,3.5,6.1,3.7]]
Pressure: [[997,997,997,996,991,...,1030,null,1019,1020,1019],[1019,1018,1020,1020,1021,...,1019,1018,1019,1019,1019]]
SignificantWeatherCode: [[8,7,-99,8,11,...,1,5,15,12,7],[15,12,12,12,15,...,0,-99,0,-99,0]]
...

Print the shape of the dataset

Python3




table.shape


Output:

(194697, 15)

Convert the pyarrow table dataset into a pandas dataframe.

Python3




df = table.to_pandas()
# Taking tanspose so the printing dataset will easy.
df.head().T


Output:

 

0

1

2

3

4

ForecastSiteCode

3002

3005

3008

3017

3023

ObservationTime

0

0

0

0

0

ObservationDate

2016-02-01 00:00:00

2016-02-01 00:00:00

2016-02-01 00:00:00

2016-02-01 00:00:00

2016-02-01 00:00:00

WindDirection

12

10

8

6

10

WindSpeed

8

2

6

8

30

WindGust

NaN

NaN

NaN

NaN

37.0

Visibility

30000.0

35000.0

50000.0

40000.0

2600.0

ScreenTemperature

2.1

0.1

2.8

1.6

9.8

Pressure

997.0

997.0

997.0

996.0

991.0

SignificantWeatherCode

8

7

-99

8

11

SiteName

BALTASOUND (3002)

LERWICK (S. SCREEN) (3005)

FAIR ISLE (3008)

KIRKWALL (3017)

SOUTH UIST RANGE (3023)

Latitude

60.749

60.139

59.53

58.954

57.358

Longitude

-0.854

-1.183

-1.63

-2.9

-7.397

Region

Orkney & Shetland

Orkney & Shetland

Orkney & Shetland

Orkney & Shetland

Highland & Eilean Siar

Country

SCOTLAND

SCOTLAND

None

SCOTLAND

SCOTLAND

Example 2

Reading parquet data using pandas.read_parquet

Python




import pandas as pd
  
df = pd.read_parquet('weather.2016.parquet')
df.head()


Output:

  ForecastSiteCode ObservationTime ObservationDate WindDirection WindSpeed WindGust Visibility ScreenTemperature Pressure SignificantWeatherCode SiteName Latitude Longitude Region Country
0 3002 0 2016-02-01 12 8 NaN 30000.0 2.1 997.0 8 BALTASOUND (3002) 60.749 -0.854 Orkney & Shetland SCOTLAND
1 3005 0 2016-02-01 10 2 NaN 35000.0 0.1 997.0 7 LERWICK (S. SCREEN) (3005) 60.139 -1.183 Orkney & Shetland SCOTLAND
2 3008 0 2016-02-01 8 6 NaN 50000.0 2.8 997.0 -99 FAIR ISLE (3008) 59.530 -1.630 Orkney & Shetland None
3 3017 0 2016-02-01 6 8 NaN 40000.0 1.6 996.0 8 KIRKWALL (3017) 58.954 -2.900 Orkney & Shetland SCOTLAND
4 3023 0 2016-02-01 10 30 37.0 2600.0 9.8 991.0 11 SOUTH UIST RANGE (3023) 57.358 -7.397 Highland & Eilean Siar SCOTLAND

Example 3

Filtering the parquet data

Python




import pandas as pd
  
df = pd.read_parquet('weather.2016.parquet', filters=[('Country', '=', 'ENGLAND')])
df.head()


Output

  ForecastSiteCode ObservationTime ObservationDate WindDirection WindSpeed WindGust Visibility ScreenTemperature Pressure SignificantWeatherCode SiteName Latitude Longitude Region Country
0 3134 0 2016-02-01 1 1 NaN 2100.0 4.3 999.0 15 GLASGOW/BISHOPTON (3134) 55.907 -4.533 Strathclyde ENGLAND
1 3210 0 2016-02-01 8 11 33.0 3100.0 8.8 1005.0 7 ST. BEES HEAD (3210) 54.518 -3.615 North West England ENGLAND
2 3212 0 2016-02-01 11 16 NaN 4800.0 11.6 1004.0 12 KESWICK (3212) 54.614 -3.157 North West England ENGLAND
3 3214 0 2016-02-01 11 24 34.0 10000.0 10.0 1005.0 8 WALNEY ISLAND (3214) 54.125 -3.257 North West England ENGLAND
4 3220 0 2016-02-01 16 -99 NaN 25000.0 11.1 1002.0 7 CARLISLE (3220) 54.933 -2.963 North West England ENGLAND

This will filter the parquet data by Country=’ENGLANDand print the first 5 rows of the DataFrame.

Aggregating the parquet data

Python




grouped = df.groupby(['Country']).mean(numeric_only=True)
# Taking tanspose so the printing dataset will easy.
print(grouped.T)


Output

Country                      ENGLAND
ForecastSiteCode         3560.622936
ObservationTime            11.517152
WindDirection               8.534412
WindSpeed                   7.770786
WindGust                   36.035424
Visibility              22431.530727
ScreenTemperature           5.336209
Pressure                 1011.335307
SignificantWeatherCode     -3.614757
Latitude                   52.354470
Longitude                  -1.586393

Group the data by Country, calculate the mean for each group, and print the resulting data frame.

Example 4

Multiple filters

Python




import pandas as pd
df = pd.read_parquet('weather.2016.parquet'
                     filters=[('Country', '=', 'ENGLAND'),
                              ('WindSpeed','<', 7)])
# Taking tanspose so the printing dataset will easy.
df.head().T


Output:

 

0

1

2

3

4

ForecastSiteCode

3134

3220

3839

3220

3839

ObservationTime

0

0

0

1

1

ObservationDate

2016-02-01 00:00:00

2016-02-01 00:00:00

2016-02-01 00:00:00

2016-02-01 00:00:00

2016-02-01 00:00:00

WindDirection

1

16

16

16

16

WindSpeed

1

-99

-99

-99

-99

WindGust

NaN

NaN

NaN

NaN

NaN

Visibility

2100.0

25000.0

NaN

15000.0

NaN

ScreenTemperature

4.3

11.1

12.1

11.2

12.3

Pressure

999.0

1002.0

NaN

1003.0

NaN

SignificantWeatherCode

15

7

-99

8

-99

SiteName

GLASGOW/BISHOPTON (3134)

CARLISLE (3220)

EXETER AIRPORT (3839)

CARLISLE (3220)

EXETER AIRPORT (3839)

Latitude

55.907

54.933

50.737

54.933

50.737

Longitude

-4.533

-2.963

-3.405

-2.963

-3.405

Region

Strathclyde

North West England

South West England

North West England

South West England

Country

ENGLAND

ENGLAND

ENGLAND

ENGLAND

ENGLAND

This will filter the parquet data by Country=’ENGLAND and WindSpeed< 7 and print the first 5 rows of the DataFrame.

Conclusion

In conclusion, partitioning parquet files is a powerful way to optimize data storage and querying performance. By partitioning data based on one or more columns, you can easily filter, sort, and aggregate data within a subset of partitions, rather than having to scan the entire dataset.

In this article, we covered two methods for reading partitioned parquet files in Python: using pandas’ read_parquet() function and using pyarrow’s ParquetDataset class. We also provided several examples of how to read and filter partitioned parquet files using these methods with real-world weather data.

Overall, partitioning parquet files is an effective technique for optimizing data storage and retrieval. Whether you’re dealing with big data or just trying to improve query performance, partitioning can help you get the most out of your data.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments