Machine Learning Algorithms in Python

What is a Decision Tree Classifier?

How they work and how to implement one from scratch

8 min readAug 29, 2020

How It Works

Data

Let’s imagine you’re a matchmaker focused on helping down-on-their-luck men and women get together with the opposite sex. As you had few clients in your first year open for business, you’ve been giving them paper surveys rather than creating an online form. And to be honest, you kind of liked the whole, ‘won’t you sit down, Miss Wonderly’ style of inviting clients into your office on a given evening to aid them in their woes.

But lately, your services have become popular! So you decide to move your forms online and into the twenty-first century.

Forms, glorious forms.

Now unfortunately, you’re a forgetful sort when it comes to those glaringly obvious details, such as including a field for ‘gender’ on the form. You have an inbox of Sams and Alexes that you don’t dare send on a blind date just yet! But you keep your cool, right? After all — you know.. Python.

Armed with your paper surveys from a bygone era, you create a spreadsheet and draw up a few graphs from the data. You find some significant gender differences in how your existing clients responded to the survey. Pacing back and forth in your office and wanting to avoid looking like an idiot, you realize you can save face and just use these graphs to sort out your new clients’ genders on your own. Besides, it’s early enough in the evening that you’re prepared to spend hours falling down a rabbit hole.

Your survey question to your clients asks them to weigh the importance of six traits (attractiveness, intelligence, ambition, shared interests, ‘fun-ness’, and sincerity). They are given 100 points to divvy among these traits; inevitably clients will value some traits more or less than others.

You take a look at the graph and try to place a vertical line through it that will separate the men from the women as best as possible, according to how much they value attractiveness. Where do you think you should put the vertical line on this graph?

If your line is something like this, you’re doing well! This vertical line does a relatively good job of separating the men and women based on how important attractiveness is to them.

Now, can you put in a horizontal line to divide the data points by gender based on the importance of sincerity? You might find that this is a lot more difficult to do.

So, of the two questions, which one makes it easier to divide the data in a way that is useful? I hope you’re thinking what I’m thinking.

Algorithm

A decision tree classifier is essentially the process of deciding the order in which to ask questions of your data in a way that will optimally split (or partition) it such that you have the highest classification accuracy in the end. How do we decide on the order in which to ask questions? You and I can just sense that the first question has more use to us. But why is that?

Unlike us, decision trees lack a gut feeling, or eyes. They have.. math. So, a decision tree must calculate which question leads to a less chaotic and noisy outcome. A word for that is entropy, or the amount of disorder left when the data is split based on a question.

When I asked you to separate the gender labels using a horizontal line, it probably felt chaotic because no matter where you put the line there are many men and women above and beneath it, meaning more entropy, or a high mix of class labels. With the first question about attractiveness, the calculated entropy resulting from both sides of the vertical line is far less.

In fact, the attractiveness question is actually the first one that the decision tree uses to separate the data. It has determined that this question reduces entropy the most and leads to the greatest information gain. And then it moves on to the next best question. And then the next.

A decision tree with a maximum depth of 3 (it can ask 3 questions before taking a majority vote on how to classify the data point.. er, human). Next time you play ‘20 questions’, you can think of it more lovingly as ‘Questions with a Maximum Depth of 20’.

The first question the decision tree above asks is whether attractiveness (a feature of the dataset) is less than or equal to a certain threshold (16.835). Depending on the answer to that question, it branches out both left and right to further partition the data with more questions. We may want to avoid making our tree ask questions until every label is correct, as this could overfit our training data, leading it to be ungeneralizable to unseen data. So we employ tactics to halt this such as maximum depth and minimum samples.

With a machine learning library such as scikit-learn, we can fit a decision tree classifier to our data in just a few lines of code. Implementing one from scratch, however, is another matter.

If you aren’t interested in even thinking about how to build a classifier with only Python’s built-in methods, numpy, scipy, and a shovel to get in over your own head, feel free to stop reading and go on with your life. For everyone else, godspeed.

Implementation

Building Blocks

To create a classifier from scratch involves putting a number of functions within a class. These functions, when run, will perform actions on our data, getting it closer to being fit-able. Something else we should understand is binary recursive partitioning, which for our algorithm is a fancy way of saying that we will

1. Split the data into left & right branches based on whether the answer to our question is ‘yes’ or ‘no’. If you’re wondering what split_feature and split_threshold are, stay tuned!

2. Work downwards (ask more questions) until we have reached a point where the subset is pure (has homogen-eous class labels) or some other hyperparameter that we have set has been reached. If the remaining subset of data is still impure, a majority vote is taken on the subset to classify it.

In order to start, we will create a dictionary of the potential places we can split our data. Think of all the data points possible for a feature as an ordered list and then find the midpoint between each unique value.

The potential splits are where we can split the data easily. So we have our split_feature and split_threshold (the values in potential splits).

This allows us to ask our first question,

“Is the importance of attractiveness to you ≤ 16.835?”

First, always begin conversations this way with people when you want them to walk away.

Second, in order to make a practical function that can ask every question possible, we need to ask something more open-ended that can iterate through all of the possibilities. Something like,

“Is the importance of split_feature to you ≤ split_threshold”?

So we have a lot of the questions we could ask of our data. We can divide our data into left and right branch nodes with our split_data() function. Once we’ve reached a certain point, we can use our make_classification() function to predict a label for some observations (data points). But how do we know where the best split is?

Optimization

We can use something called information gain to determine which question is the most helpful to us. Information gain is based on entropy.

We ask ourselves, “What is the current level of entropy in the data?”
We ask the data, “Is the importance of split_feature to you ≤ split_threshold”?
If yes, this data point(s) goes left (the value is below the threshold). What is the resulting entropy on the left side?
If no, this data point(s) goes right (the value is above the threshold). What is the resulting entropy on the right side?
Now, before we get too excited, what is the sum of the left entropy and the right entropy resulting from this question?
Most importantly, how much less entropy would that be than where the data is at right now? That is information gain.
Before deciding, though, you have to work through these steps for every feature and threshold (every possible question). Either your mind was just blown or you’re like me and the concept is strangely exciting.

What if you could apply these functions in real life? I know what I want for Christmas!

Model

After that, we put all of these functions together into a larger decisiontree() function. We then add additional helper functions to calculate entropy and purity. Finally, we fit this tree to our data and use it to predict labels. With this homemade decision tree classifier and the speed dating dataset referenced above, we get 88.2% accuracy from a baseline of 50%. With the official decision tree classifier from sklearn we see an accuracy of 89.85%.

Conclusion

Well, matchmaker, how are you doing? You spent all night on the floor of your office, wept just a little (internally), and switched from coffee to whiskey. On the other hand, you built a machine learning algorithm from scratch to classify new data to help out your business. You can now successfully send a client like Alex or Sam off to meet with their potential schmoopy, with only a small chance of giving them an unasked-for surprise. Kudos.