Dummy Variables in Machine Learning.
Machine Learning Models work very well with numerical data. However, data does not always come in such a format. Data will sometimes come in string format and this information has to somehow be incorporated into the model training. Take an example of the data set below.
In the dummy data set above, the car model column has string data( data that is in text format). The information about the car model has to be incorporated into the model as it will determine the price of the car(obviously). Because machine learning models only take in numerical data, the information about the car models will be incorporated through dummy variables.
So, how exactly do dummy variables work?
Dummy variables are numerical variables that represent the actual data. For Instance, given male and female gender, you could give 0 to represent the males and 1 for the females. This gives an actual representation of the information and and the information is in numerical format which can be integrated into the machine learning model. In this example, instead of using the gender column in the machine learning, the dummy variable column will be used as an actual representation of the gender column. This is the whole idea of dummy variables.
In pandas, we use the get_dummies function to achieve this. As shown below.
In the snapshot above, when the data represents the car, this will be represented with the number 1. In this case, 1- depicts True and 0- means False.
Now that we have created the dummy variables, the next thing is to join the dummy columns to the original data set.
The result will be as shown below. If you take an example of the first column, That data will belong to the BMW X5 Car( without looking at the car model column. ) This is because this is the column with the value of 1. That is how to interpret the results of the table.
Now that we no longer need the Car Model column, that column can be dropped from the data frame. In addition, in order to avoid the dummy variable trap, we will also drop one of the dummy variable columns.
Although one of the dummy variables has been dropped from the above data,we can still tell what column represents which car. For instance, in column number 6: The BMW and Mercedes columns both have a value of 0. This means that the data in that column will belong to neither of them, but in this case to the column we dropped(Audi A5). As you can see, although this column is dropped, we are still able to factor it in.
With that out of the way, we can now proceed to the normal Exploratory data analysis(EDA). The point of EDA is to get an idea of the distribution of the data. This will even inform on the type of model that will be used in training the data.
From the visualization, one can see that as the mileage increases, the price/value of the car decreases. Which is what is expected.
The next step would be to proceed in training the data using the Linear Regression Model. Besides the visualizations, what we are trying to predict(price) is a continuous value. This makes the linear model well suited for this problem.
As expected, drop the target variable that you are trying to predict to create the x-training data set and the y- which is basically the element being predicted.
Use the Trained model to make predictions.
In order to make a prediction, you have to enter variables to represent each variable in the training data set. For instance, in predicting the Audi Price, You need data for each of the columns below:
As per the question: The mileage is 50000, the age is 3 years, and this being an Audi car, the BMW column will be 0 and the Benz column will be zero. The two square brackets just mean it is a two-dimensional array