Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
Pandas str.get_dummies()
is used to separate each string in the caller series at the passed separator. A data frame is returned with all the possible values after splitting every string. If the text value in original data frame at same index contains the string (Column name/ Splitted values) then the value at that position is 1 otherwise, 0.
Since this is a string operation, .str has to be prefixed every time before calling this function. Otherwise, it will throw an error.
Syntax: Series.str.get_dummies(sep=’|’)
Parameters:
sep: String value, separator to split strings atReturn type: Data frame with binary values only
To download the data set used in following examples, click here.
In the following examples, the data frame used contains data of some employees. The image of data frame before any operations is attached below.
Example #1: Separating different strings on whitespace.
In this example, string in the Team column have been split at ” ” (White-space) and the data frame is returned with all possible values after splitting. The value in returned data frame is 1 if the string(Column name) exists in the text value at same index in old data frame.
Python3
# importing pandas import pandas as pd # making data frame from csv at url # making dataframe using get_dummies() dummies = data[ "Team" ]. str .get_dummies( " " ) # display dummies.head( 10 ) |
Output:
As shown in the output image, it can be compared with the original image of data frame. If the string exists at that same index, then value is 1 otherwise 0.
Important points:
Example #2: Splitting at multiple points/Static value column
In this example, a static value is taken for the new column (“Hello gfg family”). Then the get_dummies() method is applied and the string is separated at “g”. Since “g” is occurring more than once, there will be more than one column and also the values in all column must be same as the string is also same for all rows.
Python3
# importing pandas import pandas as pd # making data frame from csv at url # string for new column string = "Hello gfg family" # creating new column data[ "New_column" ] = string # creating dummies df = data[ "New_column" ]. str .get_dummies( "g" ) # display df.head( 10 ) |
Output:
As shown in output image, the new data frame has 3 columns and every row has same values.