One Hot Encoding and Hazard Prediction

One hot encoding is a process where categorical variables are converted into a form that is fed to a machine learning algorithm for more accurate prediction.

I demonstrated the concept on a dummy data curated by me. Screen Shot 2021-04-03 at 6.32.26 PM.png

In this sample dataset, 'ascites', 'edema', and 'stage' are categorical variables while 'cholesterol' is a continuous variable, since it can be any decimal value greater than zero.

In this dataset, I applied one hot encoding to the edema column because it had three categories. One hot encoding onto this column will create feature columns for each of the possible outcome. I also applied this technique to the stage column as they are not in form of 0 and 1.

Hot encoding 'stage'

Screen Shot 2021-04-03 at 6.40.46 PM.png Looking at the above table, it shows that one of the features is redundant, so to avoid multicollinearity, I dropped on of the columns.

Screen Shot 2021-04-03 at 6.43.48 PM.png

I converted the one hot encoded values to decimals since the model expects the values to be fed to it in a certain data type. I renamed the 'stage_4' column to 'stage', then converted the values to a float64 data type using numpy.

Screen Shot 2021-04-03 at 6.47.31 PM.png

Prediction:

Using the above dataset, I predicted a new patient's hazard using the hazard function

Screen Shot 2021-04-03 at 6.51.10 PM.png

where:

Theta is the coefficient of Xi features lambda(t, x) is the patient's hazard.

I multiplied the coefficient with the features. Screen Shot 2021-04-03 at 6.55.34 PM.png

Screen Shot 2021-04-03 at 6.55.46 PM.png

Screen Shot 2021-04-03 at 6.55.54 PM.png

Screen Shot 2021-04-03 at 6.56.00 PM.png

The coefficient is a 1D array, so I transposed the features which is the X

Screen Shot 2021-04-03 at 6.58.02 PM.png

Finally, I predicted the hazards of three patients.

Screen Shot 2021-04-03 at 6.59.08 PM.png

Conclusion:

I believe I have given an understanding of what one hot encoding means and how to go about it. This is the code and you can connect with me on LinkedIn . Thank you for reading.