Naive Bayes is a probabilistic classification algorithm that is based on Bayes‘ theorem and calculates the conditional probability. The assumption is made that the features are independent of each other – hence “naive”. Naive Bayes Algorithms targets the goal for a classification problem. That means, that the naive Bayes often is used in problems like Fraud Detection, Spam Filtering or medical diagnosis.
We therefore consider the following problem of a medical diagnosis of a COVID 19 Test:
The general equation of naive Bayes is written down below:
\begin{equation}
P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
\end{equation}
Given example Probabilities
\(P(T | \neg C) = 0.05\) → If a person does not have COVID-19, the test is still 5% false positive.
\(P(C) = 0.01\) → 1% of the population has COVID-19.
\(P(\neg C) = 0.99\) → 99% of the population is healthy.
\(P(T | C) = 0.95\) → If a person has COVID-19, the test is 95% positive (Sensitivity).
\begin{equation}
P(C | T) = \frac{P(T | C) \cdot P(C)}{P(T)}
\end{equation}
The probability that the test is positive, whether for an infected or a healthy person, is:
\begin{equation}
P(T) = P(T | C) \cdot P(C) + P(T | \neg C) \cdot P(\neg C) \end{equation}
Substituting the values:
\( P(T) = (0.95 \cdot 0.01) + (0.05 \cdot 0.99) \)
\( P(T) = 0.0095 + 0.0495 = 0.059 \)
Now, applying Bayes‘ formula:
\begin{equation}
P(C | T) = \frac{(0.95 \cdot 0.01)}{0.059}= \frac{0.0095}{0.059}= 0.161
\end{equation}
What does that mean in concrete?
The probability that a person actually has COVID-19, given that their test is positive, is 16.1%.
But how is this possible?
Because the disease is rare in the general population (only 1% prevalence), so a second test can be necessary.
Naive Bayes Classifier in Python
Imports for used libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
df = pd.read_csv('daten.csv')
# Data exploration
print(df.head())
print(df.describe())
print(df.shape)
# 3. Data preparing
y = df['Zielspalte']
X = df.drop(columns=['Zielspalte'])
# Split into Trainings and Test Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardisierung der Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 4. Create model & train
model = GaussianNB()
model.fit(X_train, y_train)
# 5. Modell testing (Predictions & Accuracy)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Genauigkeit: {accuracy:.2f}')
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.show()
# 6. Hyperparameter-Tuning with GridSearchCV
param_grid = {'var_smoothing': np.logspace(-9, 0, 10)}
grid_search = GridSearchCV(GaussianNB(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
# best model
best_model = grid_search.best_estimator_
print(f'Bestes Modell: {best_model}')