A central concept in building a decision tree is information gain, which is based on entropy. The entropy of a system \(H(S)\) is defined as:
\begin{equation} H(S) = – \sum_{i=1}^{n} p_i \log_2(p_i) \end{equation}
where is the \( p_i \) probability of class . Higher entropy indicates greater uncertainty in class distribution.
The information gain when splitting by attribute is calculated as the difference in entropy before and after the split:
\begin{equation} IG(S, A) = H(S) – \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \end{equation}
Here, represents the subsets of the data after splitting by .
Python implementation of a Decision Tree
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
Data Engineering
# 1. Read in Data
df = pd.read_csv('data_set.csv')
# 2. Data Exploration
print(df.head())
print(df.describe())
print(df.shape)
# Data Preperation
y = df['target_column']
X = df.drop(columns=['target_column'])
# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Creating & Validation
# Standardization of Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 4. Decision Tree Modell
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
# 5. Modell Testing (Predictions & Accuracy)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Genauigkeit: {accuracy:.2f}')
Plotting
# Konfusionsmatrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.show()