Naive Bayes

Naive Bayes is a probabilistic classification algorithm that is based on Bayes‘ theorem and calculates the conditional probability. The assumption is made that the features are independent of each other – hence “naive”. Naive Bayes Algorithms targets the goal for a classification problem. That means, that the naive Bayes often is used in problems like Fraud Detection, Spam Filtering or medical diagnosis.

We therefore consider the following problem of a medical diagnosis of a COVID 19 Test:

The general equation of naive Bayes is written down below:
\begin{equation}
P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
\end{equation}

Given example Probabilities

\(P(T | \neg C) = 0.05\) → If a person does not have COVID-19, the test is still 5% false positive.

\(P(C) = 0.01\) → 1% of the population has COVID-19.

\(P(\neg C) = 0.99\) → 99% of the population is healthy.

\(P(T | C) = 0.95\) → If a person has COVID-19, the test is 95% positive (Sensitivity).

\begin{equation}
P(C | T) = \frac{P(T | C) \cdot P(C)}{P(T)}
\end{equation}

The probability that the test is positive, whether for an infected or a healthy person, is:

\begin{equation}
P(T) = P(T | C) \cdot P(C) + P(T | \neg C) \cdot P(\neg C) \end{equation}

Substituting the values:

\( P(T) = (0.95 \cdot 0.01) + (0.05 \cdot 0.99) \)
\( P(T) = 0.0095 + 0.0495 = 0.059 \)

Now, applying Bayes‘ formula:

\begin{equation}
P(C | T) = \frac{(0.95 \cdot 0.01)}{0.059}= \frac{0.0095}{0.059}= 0.161
\end{equation}

What does that mean in concrete?

The probability that a person actually has COVID-19, given that their test is positive, is 16.1%.

But how is this possible?

Because the disease is rare in the general population (only 1% prevalence), so a second test can be necessary.

Naive Bayes Classifier in Python

Imports for used libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt


df = pd.read_csv('daten.csv')

# Data exploration
print(df.head())
print(df.describe())
print(df.shape)

# 3. Data preparing 
y = df['Zielspalte']  
X = df.drop(columns=['Zielspalte'])

# Split into Trainings and Test Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardisierung der Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# 4. Create model & train
model = GaussianNB()
model.fit(X_train, y_train)

# 5. Modell testing (Predictions & Accuracy)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Genauigkeit: {accuracy:.2f}')

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.show()

# 6. Hyperparameter-Tuning with GridSearchCV
param_grid = {'var_smoothing': np.logspace(-9, 0, 10)}
grid_search = GridSearchCV(GaussianNB(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# best model
best_model = grid_search.best_estimator_
print(f'Bestes Modell: {best_model}')

Naive Bayes Classifier in Python

Weitere Beiträge

ARIMA

Basics of CUDA Programming

Monte Carlo Simulation

VAR & CVAR