Introduction
Are you a data scientist looking for an exciting and informative read? Look no further, because I’ve got a treat for you! My latest blog post is jam-packed with fun and innovative experiments that I conducted with ChatGPT over the weekend. In this experiment, I put ChatGPT to the test and challenged it to generate the solution to a Data Science problem automatically. You won’t want to miss the incredible results that we achieved together. Join me as we dive into the nitty-gritty of how we created the prompts to achieve our desired outcome and see for yourself just how accurate the solutions were. Trust me, this is a blog post you won’t want to miss! Come, let’s find out how to use ChatGPT prompts as a Data Scientist?
From code to completion, ChatGPT makes Data Science projects a breeze!
Overview of the Experiments
I will run through 2 different experiments. In the first experiment, I want to see if ChatGPT can help me with the code for building the machine learning model on a specific dataset. We will also evaluate the code in the jupyter notebook to see if it’s accurate or not. And in the second experiment, we will take the learnings of experiment 1 and redesign prompts for desired outcomes. Broadly, we will evaluate the following points-
- Can ChatGPT create spam-free and flawless AI content?
- Want to automate your coding with ChatGPT’s dataset-specific code generation?
- Understand how to master the art of ChatGPT and tips to achieve the desired outcomes with precise prompts.
Experiment 1: ChatGPT for Data Science!
Let’s start the first experiment now.
I will consider the Black Friday Sales dataset. You can download the dataset from here. The dataset contains the customer transactions of a retail store containing customer demographics, product details, and total purchase amount. The company wants to understand customer purchase behavior for personalization. So, the ask is to build a machine learning model to predict the purchase amount based on the customer demographics and past products purchased.
In the first prompt, I am going to tell ChatGPT about the dataset and what is it about.
Prompt 1
You are provided with the dataset of the retail store containing customer transactions. Each row contains customer demographics, product details, and the total purchase amount from last month. The sample dataset is given below.
Now, the ChatGPT responds back requesting the dataset. In the next prompt, I will provide the sample dataset of the Black Friday sales dataset.
Note: You can neither upload the datasets directly to ChatGPT nor copy-paste the entire dataset.
So, we will copy and paste around 100-150 rows from the dataset.
Prompt 2
User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1005915,P00372445,M,18-25,4,C,0,0,20,,,371
1005916,P00370853,M,51-55,20,B,1,1,19,,,24
1005918,P00370853,M,26-35,12,A,3,1,19,,,12
1005919,P00370853,M,18-25,0,C,0,0,19,,,48
1005920,P00375436,F,26-35,1,C,2,0,20,,,244
1005922,P00370853,M,55+,3,C,3,0,19,,,12
1005923,P00371644,M,26-35,7,C,1,1,20,,,129
1005924,P00370293,M,36-45,0,B,0,1,19,,,49
1005925,P00371644,F,26-35,0,C,1,1,20,,,592
1005927,P00372445,M,36-45,14,B,4+,1,20,,,358
1005929,P00370853,F,36-45,0,C,2,0,19,,,50
1005931,P00372445,F,18-25,7,A,3,0,20,,,129
1005932,P00371644,M,18-25,14,C,3,0,20,,,131
1005933,P00375436,M,26-35,2,C,3,1,20,,,364
Now, let’s ask ChatGPT to write a code for building a model to predict the target variable “Purchase”.
Prompt 3
I want you to act as a data scientist and write code for me. Please build a machine learning model to predict the Purchase variable from the above dataset.
As you can see, ChatGPT provided us with the code for building the machine-learning model. We will run the code in the jupyter notebook and see if it’s working or not.
The above code throws the error.
ChatGPT missed out on a couple of data preprocessing steps-
- There are categorical variables in the dataset. ChatGPT didn’t include the code for dealing with it.
- ChatGPT failed to handle the missing values present in the dataset.
- ChatGPT didn’t drop the unnecessary columns like User ID and Product ID.
Now, in the next prompt, let me ask ChatGPT to update the data preprocessing steps in the code without explicitly mentioning the kind of steps to perform. Let’s find out if it can do it.
Prompt 4
The above code is incomplete. Update the above code with the necessary data preprocessing steps depending on the provided dataset.
The above code throws the error.
As expected, it included the code for missing value imputation and handling categorical variables. But missed out on encoding product id and user id columns.
Let’s inquire about ChatGPT to encode product id and user id columns in the next prompt.
Prompt 5
The above code gives an error. You missed encoding the user id and product id columns.
The above code throws the error. It encoded the product id and user id into new columns but didn’t drop the actual columns itself. As you can see, this is the glitchy content generated by ChatGPT.
Let’s prompt ChatGPT to revise the code.
Prompt 6
You are wrong. The above code still throws an error.
ChatGPT responds back looking for an error. Let’s copy and paste the error faced running the code. This will be our next prompt.
Prompt 7
ValueError: could not convert string to float: ‘P00233842’.
Is anything wrong with the code? Now you can see that ChatGPT missed encoding the rest of the categorical columns. This is glitchy and flaw content. It is expected to include the rest of the categorical columns since it encoded the rest of the categorical columns earlier. While fixing the encoding of the product id and user id, it missed out on the other columns.
Now, let’s inquire about ChatGPT to encode the rest of the categorical variables.
Prompt 8
You missed encoding the rest of the categorical columns. Update the code.
This time, it provided me with all the data preprocessing steps required. Lets run it in the notebook. It stills throws the error. Let’s ask ChatGPT to fix it. Hope this is our last prompt.
Prompt 9
Update the code. The code throws TypeError: Feature names are only supported if all input features have string names, but your input has [‘int’, ‘str’] as feature name / column name types
Finally, we achieved an error-free code.
Experiment 2: Data Science Prompts for ChatGPT
A couple of learnings from the first experiment are that
- Always provide detailed prompts to achieve desired outcomes.
- Tell the ChatGPT to fix the code if it’s wrong. It can fix its own code.
Now, we will start experiment 2 with our learnings.
Prompt 1
You are provided with the dataset of the retail store containing customer transactions. Each row contains customer demographics, product details, and the total purchase amount from last month. The sample dataset is given below.
Prompt 2
User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1005915,P00372445,M,18-25,4,C,0,0,20,,,371
1005916,P00370853,M,51-55,20,B,1,1,19,,,24
1005918,P00370853,M,26-35,12,A,3,1,19,,,12
1005919,P00370853,M,18-25,0,C,0,0,19,,,48
1005920,P00375436,F,26-35,1,C,2,0,20,,,244
1005922,P00370853,M,55+,3,C,3,0,19,,,12
1005923,P00371644,M,26-35,7,C,1,1,20,,,129
1005924,P00370293,M,36-45,0,B,0,1,19,,,49
1005925,P00371644,F,26-35,0,C,1,1,20,,,592
1005927,P00372445,M,36-45,14,B,4+,1,20,,,358
1005929,P00370853,F,36-45,0,C,2,0,19,,,50
1005931,P00372445,F,18-25,7,A,3,0,20,,,129
1005932,P00371644,M,18-25,14,C,3,0,20,,,131
1005933,P00375436,M,26-35,2,C,3,1,20,,,364
Prompt 3
I want you to act as a data scientist and write code for me. Please build a machine learning model to predict the Purchase variable from the above dataset. Include data preprocessing steps like dropping unnecessary ID columns, encoding categorical variables, handling missing values, and so on.
Prompt 4
Update the code that includes model evaluation.
Another inappropriate and glitchy content from ChatGPT! It generated the code for the classification problem for the regression dataset.
Prompt 5
The above code is incorrect. The given dataset is a regression problem.
Prompt 6
Update the code that includes feature engineering. Keep the rest of the steps the same.
Prompt 7
Write a code to tune the hyperparameters of the random forest. Use the smartest hyper-tuning technique to achieve the best results in less time.
Prompt 8
Write a code to visualize the most important features.
Prompt 9
I would like to explain the model results. Please write a code to interpret the model results.
Prompt 10
Please write a code to interpret the model results using lime.
Incredible! No longer programming is required. Coding just got a whole lot easier with ChatGPT.
Conclusion
In this article, we have seen how to make use of ChatGPT for Data Science. You can automate your entire coding with ChatGPT specific to the dataset. But sometimes, ChatGPT can provide glitchy and flawed AI content. Those are the times when you need to explicitly tell ChatGPT to fix and regenerate the content again. It can correct its own errors and learn from them.
Finally, we understood the importance of the right prompts to get the desired outcomes from ChatGPT for data scientist. We have also seen some of the top useful Data Science prompts as well.
That’s all for today. See you in the next blog.