Sunday, September 22, 2024
Google search engine
HomeData Modelling & AIThe Hackathon Practice Guide by Analytics Vidhya

The Hackathon Practice Guide by Analytics Vidhya

Introduction

A hackathon is a platform where you get the chance to apply your data science and machine learnin knowledge and techniques. It is a place where you canĀ evaluate yourself by competing against, andĀ learning from, fellow data science experts.

Here is an exclusive guide to help you prepare for participating in hackathons. This guide illustrates the list of important techniques which you should practice before steppingĀ into the playing ground.

Weā€™ll keep building this guide into a one place exhaustive resource for data science techniques and algorithms.

Ā 

1. Framework of the Model Building Process

This is how the framework for model building works ā€“ you get data from multiple sources and then you perform the extraction and transformation operations. Once your data has been transformed, you apply your knowledge of predictive modeling and business understanding to build predictive models.

model building process

2. Hypothesis Generation

  • In your groups, list down all possible variables which might influence the independent variable (the variable to be predicted)
  • Download the dataset provided by Analytics Vidhya
  • Next, look at the dataset and see which variables are available

Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā Make sure you always do this in this order

Ā 

3. Ā Data Exploration and Feature Engineering

  • Import the data set
  • Variable identification
  • Univariate, Bivariate and Multivariate analysis
  • Identify and Treat missing and outlier values
  • Create new variables or transform existing variables

Guides:

Ā 

Modelling Techniques

1) Logistic Regression

  • Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous
  • Used when the focus on whether or not an event occurred, rather than when it occurred
  • Here, instead of modelling the outcome Y directly, the method models the log odds(Y) using the logistic function
  • Analysis of variance (ANOVA) and logistic regression all are special cases of General Linear Model (GLM)
  • The probability of success falls between 0 and 1 for all possible values of X

picture 1 picture 2

a) Logit Transformation

Picture3

b) Logit is directly related to Odds

  • The logistic model can be written as:

Picture4

  • This implies that the odds for success can be expressed as:

Picture5

  • This relationship is the key to interpreting the coefficients in a logistic regression model

Guides:

Ā 

2) Decision Tree

  • Decision tree is a type of supervised learning algorithm
  • It works for both categorical and continuous input and output variables
  • It is a classification technique that split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables

Ā 

Decision Tree ā€“ Example

Picture6

Types of Decision Trees

  • Binary Variable Decision Tree: Decision Tree which has binary target variable then it called as Binary Variable Decision Tree. Example:- In above scenario of student problem, where the target variable was ā€œStudent will play cricket or notā€ i.e. YES or NO.
  • Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

Ā 

Decision Tree ā€“ Terminology

Ā 

Picture7

Ā 

Decision Tree ā€“ Advantage and Disadvantages

Advantages:

  • Easy to understand
  • Useful in data exploration
  • Less Data Cleaning required
  • Data type is not a constraint

Disadvantages:

  • Overfit
  • Not fit for continuous variables
  • Not Sensitive to Skewed distributions

Guides:

Ā 

Ā 

3)Ā Random Forest

  • ā€œRandom Forestā€œ is an algorithm to perform very intensiveĀ calculations.
  • Random forest is like a bootstrapping algorithm with Decision tree (CART) model.
  • Random forest gives much moreĀ accurate predictions when compared to simple CART/CHAID or regression models in many scenarios.
  • It captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction.
  • A different subset of the training data and subset of variables are selected for each tree
  • Remaining training data are used to estimate error and variable importance

Ā 

Random Forest ā€“ Advantages and Disadvantages

Advantages:

  • No need for pruning trees
  • Accuracy and variable importance generated automatically
  • Not very sensitive to outliers in training data
  • Easy to set parameters

Disadvantages:

  • Over fitting is not a problem
  • It is black box, rules behind model building can not be explained

Guides:

Ā 

4) Support Vector Machine(SVM)

  • It is a classification technique.
  • Support Vectors are simply the coordinates of individual observation
  • Support Vector Machine is a frontier which best segregates the one class from other
  • Solving SVMs is a quadratic programming problem
  • Seen by many as the most successful current text classification method

Ā 

Case Study 1

We have a population of 50% males and 50% females. Here, we want to create some set of rules which will guide the gender class for the rest of the population.Picture8

The blue circles in the plot represent females and the green squares represents male.

Males in our population have a higher average height.

Females in our population have longer scalp hairs.

Ā 

Case Study 2

picture 10

Ā 

Guides:

Ā 

Text Mining:

Text mining is the analysis of data contained in natural language text. Text mining works by transposing words and phrases of unstructured data into numerical values which can then be linked with structured data in a database and analyzed with traditional data mining techniques.

Guides:

Ā 

End Notes

In this guide we talked about various modelling techniques, text analytics and the various stages which are necessary for a perfect model building.

If you like what you just read & want to continue your analytics learning,Ā subscribe to our emails,Ā follow us on twitterĀ or like ourĀ facebookĀ page.

Ā 

avcontentteam

26 Apr 2018

RELATED ARTICLES

Most Popular

Recent Comments

ź°•ģ„œźµ¬ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
źøˆģ²œźµ¬ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ź“‘ėŖ…ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ź“‘ėŖ…ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė¶€ģ²œģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
źµ¬ģ›”ė™ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ź°•ģ„œźµ¬ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ģ˜¤ģ‚°ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ź“‘ėŖ…ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ģ•ˆģ–‘ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ė¶€ģ²œģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė™ķƒ„ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ģ„œģšøģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė¶„ė‹¹ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė¶€ģ²œģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ķ™”ź³”ė™ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ź°•ģ„œźµ¬ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ź³ ģ–‘ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ķ™”ģ„±ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ģ²œķ˜øė™ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?