Thursday, December 26, 2024
Google search engine
HomeLanguagesHow to check if something is a RDD or a DataFrame in...

How to check if something is a RDD or a DataFrame in PySpark ?

In this article we are going to check the data is an RDD or a DataFrame using isinstance(), type(), and dispatch methods.

Method 1. Using isinstance() method

It is used to check particular data is RDD or dataframe. It returns the boolean value.

Syntax: isinstance(data,DataFrame/RDD)

where

  • data is our input data
  • DataFrame is the method from pyspark.sql module
  • RDD  is the method from pyspark.sql module

Example Program to check our data is dataframe or not:

Python3




# importing module
import pyspark
 
#import DataFrame
from pyspark.sql import DataFrame
 
# importing sparksession
# from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [[1, "sravan", "company 1"],
        [2, "ojaswi", "company 1"],
        [3, "rohith", "company 2"],
        [4, "sridevi", "company 1"],
        [1, "sravan", "company 1"],
        [4, "sridevi", "company 1"]]
 
# specify column names
columns = ['ID', 'NAME', 'Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# check if it is dataframe or not
print(isinstance(dataframe, DataFrame))


Output:

True

Check the data is RDD or not:

By using isinstance() method we can check.

Syntax: isinstance(data,RDD)

where

  1. data is our input data
  2. RDDis the method from pyspark.sql module

Example:

Python3




# import DataFrame
from pyspark.sql import DataFrame
 
# import RDD
from pyspark.rdd import RDD
 
# need to import for session creation
from pyspark.sql import SparkSession
 
# creating the  spark session
spark = SparkSession.builder.getOrCreate()
 
# create an rdd with some data
data = spark.sparkContext.parallelize([("1", "sravan", "vignan", 67, 89),
                                       ("2", "ojaswi", "vvit", 78, 89),
                                       ("3", "rohith", "vvit", 100, 80),
                                       ("4", "sridevi", "vignan", 78, 80),
                                       ("1", "sravan", "vignan", 89, 98),
                                       ("5", "gnanesh", "iit", 94, 98)])
 
# check the data is  rdd or not
print(isinstance(data, RDD))


Output:

True

Convert the RDD into DataFrame and check the type

Here we will create an RDD and convert it to dataframe using toDF() method and check the data.

Python3




# import DataFrame
from pyspark.sql import DataFrame
 
# import RDD
from pyspark.rdd import RDD
 
# need to import for session creation
from pyspark.sql import SparkSession
 
# creating the  spark session
spark = SparkSession.builder.getOrCreate()
 
# create an rdd with some data
rdd = spark.sparkContext.parallelize([(1, "Sravan", "vignan", 98),
                                      (2, "bobby", "bsc", 87)])
 
# check if it is an RDD
print(" RDD : ", isinstance(rdd, RDD))
 
# check if it is an DataFrame
print("Dataframe : ", isinstance(rdd, DataFrame))
 
# display data of rdd
print("Rdd Data : \n", rdd.collect())
 
# convert rdd to dataframe
data = rdd.toDF()
 
# check if it is an RDD
print("RDD : ", isinstance(rdd, RDD))
 
# check if it is an DataFrame
print("Dataframe : ", isinstance(rdd, DataFrame))
 
# display dataframe
data.collect()


Output:

Method 2: Using type() function

type() command is used to return the type of the given object.

Syntax: type(data_object)

Here, dataobject is the rdd or dataframe data.

Example 1: Python program to create data with RDD and check the type

Python3




# need to import for session creation
from pyspark.sql import SparkSession
 
# creating the  spark session
spark = SparkSession.builder.getOrCreate()
 
# create an rdd with some data
rdd = spark.sparkContext.parallelize([(1, "Sravan","vignan",98),
                                      (2, "bobby","bsc",87)])
 
# check the type using type() command
print(type(rdd))


Output:

<class 'pyspark.rdd.RDD'>

Example 2: Python program to create dataframe and check the type.

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data =[[1,"sravan","company 1"],
       [2,"ojaswi","company 1"],
       [3,"rohith","company 2"],
       [4,"sridevi","company 1"],
       [1,"sravan","company 1"],
       [4,"sridevi","company 1"]]
 
# specify column names
columns=['ID','NAME','Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
 
# check that type of
# data with type() command
print(type(dataframe))


Output:

<class 'pyspark.sql.dataframe.DataFrame'>

Method 3: Using Dispatch

The dispatch decorator creates a dispatcher object with the name of the function and stores this object, We can refer to this object to do the operations. Here we are creating an object to check our data is either RDD or DataFrame. So we are using single dispatch

Example 1: Python code to create a single dispatcher and pass the data and check the data is rdd or not

Python3




# importing module
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# import singledispatch
from functools import singledispatch
 
# import spark context
from pyspark import SparkContext
 
# createan object for spark
# context with local and name is GFG
sc = SparkContext("local", "GFG")
 
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# import DataFrame
 
# import RDD
 
# creating the  spark session
spark = SparkSession.builder.getOrCreate()
 
# create a function to dispatch our function
@singledispatch
def check(x):
    pass
 
# this function is for returning
# an RDD if the given input is RDD
@check.register(RDD)
def _(arg):
    return "RDD"
 
# this function is for returning
# an RDD if the given input is DataFrame
@check.register(DataFrame)
def _(arg):
    return "DataFrame"
 
# create an pyspark dataframe
# and check whether it is RDD or not
print(check(sc.parallelize([("1", "sravan", "vignan", 67, 89)])))


Output:

RDD

Example 2: Python code to check whether the data is dataframe or not

Python3




# importing module
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# import singledispatch
from functools import singledispatch
 
# import spark context
from pyspark import SparkContext
 
# createan object for spark
# context with local and name is GFG
sc = SparkContext("local", "GFG")
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# creating the  spark session
spark = SparkSession.builder.getOrCreate()
 
# create a function to dispatch our function
@singledispatch
def check(x):
    pass
 
# this function is for returning
# an RDD if the given input is RDD
@check.register(RDD)
def _(arg):
    return "RDD"
 
# this function is for returning
# an RDD if the given input is DataFrame
@check.register(DataFrame)
def _(arg):
    return "DataFrame"
 
# create an pyspark dataframe and
# check whether it is dataframe or not
print(check(spark.createDataFrame([("1", "sravan",
                                    "vignan", 67, 89)])))


Output:

DataFrame

RELATED ARTICLES

Most Popular

Recent Comments