In this article we are going to check the data is an RDD or a DataFrame using isinstance(), type(), and dispatch methods.
Method 1. Using isinstance() method
It is used to check particular data is RDD or dataframe. It returns the boolean value.
Syntax: isinstance(data,DataFrame/RDD)
where
- data is our input data
- DataFrame is the method from pyspark.sql module
- RDD is the method from pyspark.sql module
Example Program to check our data is dataframe or not:
Python3
# importing module import pyspark #import DataFrame from pyspark.sql import DataFrame # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ], [ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ], [ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # check if it is dataframe or not print ( isinstance (dataframe, DataFrame)) |
Output:
True
Check the data is RDD or not:
By using isinstance() method we can check.
Syntax: isinstance(data,RDD)
where
- data is our input data
- RDDis the method from pyspark.sql module
Example:
Python3
# import DataFrame from pyspark.sql import DataFrame # import RDD from pyspark.rdd import RDD # need to import for session creation from pyspark.sql import SparkSession # creating the spark session spark = SparkSession.builder.getOrCreate() # create an rdd with some data data = spark.sparkContext.parallelize([( "1" , "sravan" , "vignan" , 67 , 89 ), ( "2" , "ojaswi" , "vvit" , 78 , 89 ), ( "3" , "rohith" , "vvit" , 100 , 80 ), ( "4" , "sridevi" , "vignan" , 78 , 80 ), ( "1" , "sravan" , "vignan" , 89 , 98 ), ( "5" , "gnanesh" , "iit" , 94 , 98 )]) # check the data is rdd or not print ( isinstance (data, RDD)) |
Output:
True
Convert the RDD into DataFrame and check the type
Here we will create an RDD and convert it to dataframe using toDF() method and check the data.
Python3
# import DataFrame from pyspark.sql import DataFrame # import RDD from pyspark.rdd import RDD # need to import for session creation from pyspark.sql import SparkSession # creating the spark session spark = SparkSession.builder.getOrCreate() # create an rdd with some data rdd = spark.sparkContext.parallelize([( 1 , "Sravan" , "vignan" , 98 ), ( 2 , "bobby" , "bsc" , 87 )]) # check if it is an RDD print ( " RDD : " , isinstance (rdd, RDD)) # check if it is an DataFrame print ( "Dataframe : " , isinstance (rdd, DataFrame)) # display data of rdd print ( "Rdd Data : \n" , rdd.collect()) # convert rdd to dataframe data = rdd.toDF() # check if it is an RDD print ( "RDD : " , isinstance (rdd, RDD)) # check if it is an DataFrame print ( "Dataframe : " , isinstance (rdd, DataFrame)) # display dataframe data.collect() |
Output:
Method 2: Using type() function
type() command is used to return the type of the given object.
Syntax: type(data_object)
Here, dataobject is the rdd or dataframe data.
Example 1: Python program to create data with RDD and check the type
Python3
# need to import for session creation from pyspark.sql import SparkSession # creating the spark session spark = SparkSession.builder.getOrCreate() # create an rdd with some data rdd = spark.sparkContext.parallelize([( 1 , "Sravan" , "vignan" , 98 ), ( 2 , "bobby" , "bsc" , 87 )]) # check the type using type() command print ( type (rdd)) |
Output:
<class 'pyspark.rdd.RDD'>
Example 2: Python program to create dataframe and check the type.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ], [ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ], [ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data,columns) # check that type of # data with type() command print ( type (dataframe)) |
Output:
<class 'pyspark.sql.dataframe.DataFrame'>
Method 3: Using Dispatch
The dispatch decorator creates a dispatcher object with the name of the function and stores this object, We can refer to this object to do the operations. Here we are creating an object to check our data is either RDD or DataFrame. So we are using single dispatch
Example 1: Python code to create a single dispatcher and pass the data and check the data is rdd or not
Python3
# importing module from pyspark.rdd import RDD from pyspark.sql import DataFrame import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import singledispatch from functools import singledispatch # import spark context from pyspark import SparkContext # createan object for spark # context with local and name is GFG sc = SparkContext( "local" , "GFG" ) # creating sparksession # and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # import DataFrame # import RDD # creating the spark session spark = SparkSession.builder.getOrCreate() # create a function to dispatch our function @singledispatch def check(x): pass # this function is for returning # an RDD if the given input is RDD @check .register(RDD) def _(arg): return "RDD" # this function is for returning # an RDD if the given input is DataFrame @check .register(DataFrame) def _(arg): return "DataFrame" # create an pyspark dataframe # and check whether it is RDD or not print (check(sc.parallelize([( "1" , "sravan" , "vignan" , 67 , 89 )]))) |
Output:
RDD
Example 2: Python code to check whether the data is dataframe or not
Python3
# importing module from pyspark.rdd import RDD from pyspark.sql import DataFrame import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import singledispatch from functools import singledispatch # import spark context from pyspark import SparkContext # createan object for spark # context with local and name is GFG sc = SparkContext( "local" , "GFG" ) # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # creating the spark session spark = SparkSession.builder.getOrCreate() # create a function to dispatch our function @singledispatch def check(x): pass # this function is for returning # an RDD if the given input is RDD @check .register(RDD) def _(arg): return "RDD" # this function is for returning # an RDD if the given input is DataFrame @check .register(DataFrame) def _(arg): return "DataFrame" # create an pyspark dataframe and # check whether it is dataframe or not print (check(spark.createDataFrame([( "1" , "sravan" , "vignan" , 67 , 89 )]))) |
Output:
DataFrame