Introduction to Machine Learning (ML) - Classification Tree
What is Machine Learning? Machine learning involves creating and teaching models how to recognise patterns in datasets such that they can be used for prediction. For instance, if you have a dataset that says student A has height of 1.5m and weight of 50kg, and student B of height 1.75m and weight 75 kg, a linear model might predict that student C who has height of 2m will have a weight of 100kg. However, the actual weight of the student of height 2m might not be exactly 100kg, but it should be close if the model is accurate. This linear regression model is a simple example of supervised machine learning, where we know the y-values in the training dataset which is the weight.
Another example of this is the classification tree, where we can traverse the tree to find out the important factors that affect the classification.
For instance, suppose we wish to find out if a member of a popular gym, sweat45, will renew his or her gym membership. Based on the simplified example below, we can see that those who are young and have been a member for >= 1 year all renewed their membership, while all those who were young but have NOT been a member for >= all did not renew their membership. As such, the next time the model encounters a dataset of someone who is young and a new member, it will predict that he will not renew his membership.
Examining the tree further, we can see the important factors affecting one's decision to renew his annual membership across the three age groups;
Additionally, based on the data dictionary below, can see that "Renewed membership" is the dependent variable with the other decision nodes being the independent variables.
In deciding which node to do the split on, there is this concept of node purity, where the algorithm favours the split on the node that results in (ideally) all the categorical y-variable data points having the same value. This is why the node "Young" was split on "Has been a member for >=1 year" rather than "Has a full-time job".
Once the tree has been trained, it will be tested against a fresh dataset and if the prediction results are satisfactory, it can then be used as a trained model on future datasets to aid in business decisions. Otherwise, feature selection (deciding which x-variables to use and keep) occurs and the training process restarts.
Classification Trees are simple to understand ("white box") and it is easy to explain to stakeholders the model's predictions. It also allows for multi-variate analysis with the ability to handle both categorical as well as numerical data. Because of the splits at the various nodes, the runtime taken is very efficient O(log n).
Now that we understand how a classification tree works, stay tuned for to the next post where I will be explaining clustering, which is a form of unsupervised machine learning.
Comments
Post a Comment