Decision Trees

A central concept in building a decision tree is information gain, which is based on entropy. The entropy of a system \(H(S)\) is defined as:

\begin{equation} H(S) = – \sum_{i=1}^{n} p_i \log_2(p_i) \end{equation}

where is the \( p_i \) probability of class . Higher entropy indicates greater uncertainty in class distribution.

The information gain when splitting by attribute is calculated as the difference in entropy before and after the split:

\begin{equation} IG(S, A) = H(S) – \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \end{equation}

Here, represents the subsets of the data after splitting by .

Python implementation of a Decision Tree

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

Data Engineering

# 1. Read in Data
df = pd.read_csv('data_set.csv')

# 2. Data Exploration
print(df.head())
print(df.describe())
print(df.shape)

# Data Preperation
y = df['target_column'] 
X = df.drop(columns=['target_column'])

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Creating & Validation

# Standardization of Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. Decision Tree Modell
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# 5. Modell Testing (Predictions & Accuracy)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Genauigkeit: {accuracy:.2f}')

Plotting

# Konfusionsmatrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.show()