PySpark – Select columns by type

26 July 2024

1

In this article, we will discuss how to select columns by type in PySpark using Python.

Let’s create a dataframe for demonstration

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# import data field types 
from pyspark.sql.types import StringType, DoubleType,  
IntegerType, StructType, StructField, FloatType 
  
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi", 
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)] 
  
# specify column names with data types 
columns = StructType([ 
    StructField("ID", IntegerType(), True), 
    StructField("NAME", StringType(), True), 
    StructField("GPA", FloatType(), True), 
    StructField("FEE", DoubleType(), True), 
  
]) 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
  
# display 
dataframe.show() 

Output:

We can select the column by name using the following keywords:

Integer: int
String : string
Float: float
Double: double

Method 1: Using dtypes()

Here we are using dtypes followed by startswith() method to get the columns of a particular type.

Syntax: dataframe[[item[0] for item in dataframe.dtypes if item[1].startswith(‘datatype’)]]

where,

dataframe is the input dataframe

datatype refers the keyword types

item defines the values in the column

And finally, we are using collect() method to display column data

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# import data field types 
from pyspark.sql.types import (StringType, 
DoubleType, IntegerType, StructType, 
StructField, FloatType) 
  
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi", 
                                      9.2, 6789.00),  
        (3, "bobby", 8.9, 988.000)] 
  
# specify column names with data types 
columns = StructType([ 
    StructField("ID", IntegerType(), True), 
    StructField("NAME", StringType(), True), 
    StructField("GPA", FloatType(), True), 
    StructField("FEE", DoubleType(), True), 
  
]) 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# select columns that are integer type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('int')]].collect()) 
  
# select columns that are string type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('string')]].collect()) 
  
# select columns that are float type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('float')]].collect()) 
  
# select columns that are double type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('double')]].collect()) 

Output:

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

Method 2: Using schema.fields

Here we are using schema.fields method to get the type of the columns. We are checking the particular type using methods that are available in pyspark.sql.types module.

Let’s check one by one:

Integer – IntegerType
Float-FloatType
Double – DoubleType
String- StringType

We are using isinstance() operator to check with these data types.

Syntax: dataframe[[f.name for f in dataframe.schema.fields if isinstance(f.dataType, datatype)]]

where,

dataframe is the input dataframe

name is the values

datatype refers to above types

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# import data field types 
from pyspark.sql.types import StringType, DoubleType, 
IntegerType, StructType, StructField, FloatType 
  
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi", 
                                      9.2, 6789.00),  
        (3, "bobby", 8.9, 988.000)] 
  
# specify column names with data types 
columns = StructType([ 
    StructField("ID", IntegerType(), True), 
    StructField("NAME", StringType(), True), 
    StructField("GPA", FloatType(), True), 
    StructField("FEE", DoubleType(), True), 
  
]) 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# select columns that are integer type 
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance( 
    f.dataType, IntegerType)]].collect()) 
  
# select columns that are string type 
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance( 
    f.dataType, StringType)]].collect()) 
  
# select columns that are float type 
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance( 
    f.dataType, FloatType)]].collect()) 
  
# select columns that are double type 
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance( 
    f.dataType, DoubleType)]].collect()) 

Output:

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

PySpark – Select columns by type

Let’s create a dataframe for demonstration

Python3

Method 1: Using dtypes()

Python3

Method 2: Using schema.fields

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US