Prerequisite: What is Unstructured Data?
Sometimes machine generates data in an unstructured way which is less interpretable. For example, Biometric Data, where an employee does Punch – IN or OUT several times with mistakes. We can not analyze the data and identify the mistakes unless it’s in a tabular form. In this article, we will take unstructured biometric data and convert it into useful information in terms of a table.
Dataset:
Here we will work with Daily Punch – In Report. Data is given below. Punch records captured for Main Door and Second Door. Main Door is Outdoor Gate and Second Door is Project Room Gate. We need to identify which employee spent how much time in the project room or second door. The dataset we’re going to use is Bio.xlsx:
Here is the Biometric Data for John Sherrif where punch in and punch out records has given for Main Door and Second Door.
Department: | ||
---|---|---|
Emp Code: COMP123:John Sherrif | ||
Att. Date | Status | Punch Records |
02-Mar-2021 | Present |
08:33:in(Second Door),08:35:(Second Door),08:37:(Main Door),09:04:out(Second Door),09:09:in(Second Door), 09:15:out(Second Door),09:15:(Second Door),09:18:(Second Door),09:52:in(Second Door),09:54:(Second Door), 10:00:out(Main Door),10:17:in(Main Door),10:53:out(Second Door),11:47:in(Second Door),11:47:(Second Door), 11:49:(Second Door),11:50:(Second Door),13:08:out(Second Door),13:09:(Second Door),13:12:(Second Door), 13:14:in(Second Door),13:36:out(Second Door),13:36:(Second Door),14:27:in(Second Door),14:32:out(Main Door), 14:48:in(Second Door),14:48:(Second Door),14:49:(Second Door),14:52:(Main Door),14:56:out(Second Door), 14:57:(Second Door),14:59:(Second Door),15:04:in(Second Door),16:22:out(Second Door),16:34:in(Second Door), 19:58:out(Main Door), |
The above Data is not insightful to analyze. Our desired output is:
Emp Code | Punch – IN | Punch – OUT |
---|---|---|
COMP123:John Sherrif | 08:33:in(Second Door) | 09:04:out(Second Door) |
COMP123:John Sherrif | 09:09:in(Second Door) | 09:15:out(Second Door) |
COMP123:John Sherrif | 09:52:in(Second Door) | 10:53:out(Second Door) |
COMP123:John Sherrif | 11:47:in(Second Door) | 13:08:out(Second Door) |
COMP123:John Sherrif | 13:14:in(Second Door) | 13:36:out(Second Door) |
COMP123:John Sherrif | 14:27:in(Second Door) | out |
COMP123:John Sherrif | 14:48:in(Second Door) | 14:56:out(Second Door) |
COMP123:John Sherrif | 15:04:in(Second Door) | 16:22:out(Second Door) |
COMP123:John Sherrif | 16:34:in(Second Door) | out |
Understanding the Data: John Sherrif did Punch – IN at 08:33 for the first time and Punch – OUT at 09:04 for the first time. John did Punch – IN at 14:27 but forgot to do Punch – OUT. The ‘in’ signifies he/she forgot to do Punch IN and ‘out’ signifies vice versa.
Implementation:
- Data Cleaning & Creating a table for status, Punch Code, and Emp Code.
Python3
import pandas as pd # load data df = pd.read_excel( 'bio.xlsx' ) # removing NA values from the # dataframe df df = df.fillna("") # removing all the blank rows df1 = df.dropna(how = 'all' ) # picking the rows where present # or absent values are there from # 14 no column df1 = df1[df1[ 'Unnamed: 14' ]. str .contains( 'sent' )] # Extracting only the Employee # Names df_name = df.dropna(how = 'all' ) # from column no 3 we are picking # Employee names df_name = df_name[df_name[ 'Unnamed: 3' ]. str .contains( 'Employee' )] # creating a new dataframe for Status, # Punch Records and Employee Codes zippedList = list ( zip (df1[ 'Unnamed: 14' ], df1[ 'Unnamed: 15' ], df_name[ 'Unnamed: 7' ])) abc = pd.DataFrame(zippedList) abc.head() |
Output:
- Extracting Data for Second Door only.
Python3
# Splitting the values by comma in 1 # no column (punch records) for i in range ( len (abc)): abc[ 1 ][i] = abc[ 1 ][i].split( "," ) second_door = [] for i in range ( len (abc)): s_d = [] # Extracting all the values which contains # only :in(Second Door) or :out(Second Dorr) for j in range ( len (abc[ 1 ][i])): if ':in(Second Door)' in abc[ 1 ][i][j]: s_d.append(abc[ 1 ][i][j]) if 'out(Second Door)' in abc[ 1 ][i][j]: s_d.append(abc[ 1 ][i][j]) second_door.append(s_d) (second_door[ 0 ]) |
Output:
- The punch record should start with ‘IN’ and end with ‘OUT’. Creating the pattern if it doesn’t follow.
Python3
# Punch Records should start with # the keyword 'in'. If it doesn't # follow then we will add 'in' and it # significants that the employee forgot # to do punch in in_time = [] for i in range ( len (second_door)): try : if ':in(Second Door)' not in second_door[i][ 0 ]: second_door[i].insert( 0 , 'in' ) except : pass # Punch Records should end with the keyword # 'out'. If it doesn't follow then we will # add 'out' and it significants that the # employee forgot to do punch out out_time = [] for i in range ( len (second_door)): try : if ':out(Second Door)' not in second_door[i][( len (second_door[i])) - 1 ]: second_door[i].insert((( len (second_door[i]))), 'out' ) except : pass second_door[ 0 ] |
Output:
- Creating the pattern ‘IN – OUT – IN – …..- OUT’. If someone forgot to do Punch – IN then we will put ‘IN’ & if someone forgot to do Punch – OUT then we will put ‘OUT’.
Python3
# final_in contains PUNCH - IN # records for all employees final_in = [] # final_out contains PUNCH - OUT # records for all employees final_out = [] for k in range ( len (second_door)): in_gate = [] out_gate = [] # even position should be for Punch- # IN and odd position should be for # Punch - OUT if it doesn't follow # then we will create the pattern by # putting 'in' or 'out' for i in range ( len (second_door[k])): if i % 2 = = 0 and 'in' in second_door[k][i]: in_gate.append(second_door[k][i]) try : if 'out' not in second_door[k][i + 1 ]: out_gate.append( 'out' ) except : pass if i % 2 ! = 0 and 'out' in second_door[k][i]: out_gate.append(second_door[k][i]) try : if 'in' not in second_door[k][i + 1 ]: in_gate.append( 'in' ) except : pass if i % 2 ! = 0 and 'in' in second_door[k][i]: in_gate.append(second_door[k][i]) try : if 'out' not in second_door[k][i + 1 ]: out_gate.append( 'out' ) except : pass if i % 2 = = 0 and 'out' in second_door[k][i]: out_gate.append(second_door[k][i]) try : if 'in' not in second_door[k][i + 1 ]: in_gate.append( 'in' ) except : pass final_in.append(in_gate) final_out.append(out_gate) # final_in or final_out keep the # records as a list under list form. # to solve the problem we will merge the list # aa contains merged list of Punch - IN aa = final_in[ 0 ] for i in range ( len (final_in) - 1 ): aa = aa + final_in[i + 1 ] # bb contains merged list of Punch - OUT bb = final_out[ 0 ] for i in range ( len (final_out) - 1 ): bb = bb + final_out[i + 1 ] for i in range ( len (final_in[ 0 ])): print (final_in[ 0 ][i], ' ' , final_out[ 0 ][i]) |
Output:
- Creating the final table.
Python
# Creating a dataframe called df_final df_final = [] df_final = pd.DataFrame(df_final) # Merging the Employee Names Name = [] for i in range ( len (abc)): for j in range ( len (final_in[i])): Name.append(abc[ 2 ][i]) df_final[ 'Name' ] = Name # Zipping the Employee Name, Punch -IN # records and Punch - OUT records zippedList2 = list ( zip (df_final[ 'Name' ], aa, bb)) abc2 = pd.DataFrame(zippedList2) # Renaming the dataframe abc2.columns = [ 'Emp Code' , 'Punch - IN' , 'Punch - OUT' ] abc2.to_excel( 'output.xlsx' , index = False ) # Print the table display(abc2) |
Output:
Hence, the raw biometric data has been structured and is converted to useful information.