Selecting only numeric or string columns names from PySpark DataFrame

By Nokonwaba Nkukhwana

27 July 2024

0

1

In this article, we will discuss how to select only numeric or string column names from a Spark DataFrame.

Methods Used:

createDataFrame: This method is used to create a spark DataFrame.
isinstance: This is a Python function used to check if the specified object is of the specified type.
dtypes: It returns a list of tuple (columnName,type). The returned list contains all columns present in DataFrame with their data types.
schema.fields: It is used to access DataFrame fields metadata.

Method #1:

In this method, dtypes function is used to get a list of tuple (columnName, type).

Python3

from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
 
 
spark = SparkSession.builder.getOrCreate()
 
# Creating dataframe from list of Row
df = spark.createDataFrame([
    Row(a=1, b='string1', c=date(2021, 1, 1)),
    Row(a=2, b='string2', c=date(2021, 2, 1)),
    Row(a=4, b='string3', c=date(2021, 3, 1))
])
 
# Printing DataFrame structure
print("DataFrame structure:", df)
 
# Getting list of columns and printing
# result
dt = df.dtypes
print("dtypes result:", dt)
 
# Getting list of columns having type
# string or bigint
# This statement will loop over all the 
# tuples present in dt list
# item[0] will contain column name and
# item[1] will contain column type
columnList = [item[0] for item in dt if item[1].startswith(
    'string') or item[1].startswith('bigint')]
print("Result: ", columnList)

Output:

DataFrame structure: DataFrame[a: bigint, b: string, c: date]
dtypes result: [('a', 'bigint'), ('b', 'string'), ('c', 'date')]
Result:  ['a', 'b']

Method #2:

In this method schema.fields is used to get fields metadata then column data type is extracted from metadata and compared with the desired data type.

Python3

from pyspark.sql.types import StringType, LongType
from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
 
 
# Initializing spark session
spark = SparkSession.builder.getOrCreate()
 
# Creating dataframe from list of Row
df = spark.createDataFrame([
    Row(a=1, b='string1', c=date(2021, 1, 1)),
    Row(a=2, b='string2', c=date(2021, 2, 1)),
    Row(a=4, b='string3', c=date(2021, 3, 1))
])
 
# Printing DataFrame structure
print("DataFrame structure:", df)
 
# Getting and printing metadata
meta = df.schema.fields
print("Metadata: ", meta)
 
# Getting list of columns having type 
# string or int
# This statement will loop over all the fields
# field.name will return column name and
# field.dataType will return column type
columnList = [field.name for field in df.schema.fields if isinstance(
    field.dataType, StringType) or isinstance(field.dataType, LongType)]
print("Result: ", columnList)

Output:

DataFrame structure: DataFrame[a: bigint, b: string, c: date]

Metadata: [StructField(a,LongType,true), StructField(b,StringType,true), StructField(c,DateType,true)]

Result: [‘a’, ‘b’]

Selecting only numeric or string columns names from PySpark DataFrame

Methods Used:

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

One UI 7: Everything you need to know

Review: The Ulefone Armor Mini 20T Pro makes other rugged phones seem flimsy

Best midrange Android phones in 2024

I tried a Xiaomi mid-ranger for the first time in years, and I’m glad the Pixel 8a exists in the US

Recent Comments

EDITOR PICKS

One UI 7: Everything you need to know

Review: The Ulefone Armor Mini 20T Pro makes other rugged phones seem flimsy

Best midrange Android phones in 2024

POPULAR POSTS

One UI 7: Everything you need to know

Review: The Ulefone Armor Mini 20T Pro makes other rugged phones seem flimsy

Best midrange Android phones in 2024

POPULAR CATEGORY

ABOUT US

FOLLOW US