Neural networks are considered extremely flexible mathematical models whose structure is loosely inspired by the workings of biological nerve cells. In practice, such models consist of multiple layers of interconnected processing units. These units, often referred to as “neurons,” receive numerical inputs, multiply them by learnable parameters (weights), add shifts (biases), and then apply a nonlinear activation. In this way, a wide range of data relationships can be modeled—from simple classification tasks to highly complex pattern recognition problems.
To understand the mathematical essence of such a network, it helps to first look at the process within a single layer. Consider an input vector \( \mathbf{x} \in \mathbb{R}^d \) and a weight matrix \( W \in \mathbb{R}^{m \times d} \).
By multiplying these two quantities, one obtains an output signal
\( \mathbf{z} = W \mathbf{x} \), which can be interpreted row by row as a linear combination of the elements of \( \mathbf{x} \). Since neurons typically also include a bias, a constant is added to each element of \( \mathbf{z} \), so that \( \mathbf{z} \leftarrow W \mathbf{x} + \mathbf{b} \).
Because purely linear combinations alone cannot represent complex nonlinearities such as decision boundaries in classification tasks, a so-called activation function is introduced. This function \( f \) is usually applied componentwise:
\( \mathbf{a} = f(\mathbf{z}) \). Common activations found in neural networks include the Sigmoid function
\( \sigma(z) = \frac{1}{1 + e^{-z}} \), the hyperbolic tangent \( \tanh(z) \), or the increasingly popular ReLU function \( \mathrm{ReLU}(z) = \max(0,, z) \).
Whereas Sigmoid and \( \tanh \) produce values between 0 and 1 or between -1 and 1, respectively, ReLU cuts off everything negative at 0 and leaves positive values unchanged. These nonlinear distortions give networks their high expressive power: Without the activation, a network would reduce to a single matrix multiplication, which mathematically would amount to just one linear transformation.
By stacking several such layers, one obtains a multilayer or “deep” neural network. Each layer passes its output \( \mathbf{a}^{(l)} \) as input to the next layer, until finally a vector \( \hat{\mathbf{y}} = \mathbf{a}^{(L)} \)
emerges at the end as the final output (prediction). In classification tasks, \( \hat{\mathbf{y}} \)
usually represents the estimated probabilities for various classes, while in regression tasks it corresponds to a real-valued number.
What is truly fascinating about neural networks is their ability to learn the weights \( W \) and biases \( \mathbf{b} \) automatically from data. This process can be framed as an optimization problem: One defines a loss function \( \mathcal{L}(\hat{\mathbf{y}}, \mathbf{y}) \), which measures the error between the predicted value \( \hat{\mathbf{y}} \) and the true value \( \mathbf{y} \).
Typical examples include the Mean Squared Error for regression or Cross-Entropy for classification.
To solve this optimization problem, a variant of gradient descent is often used. In this procedure, the network’s parameters are updated in the direction of the negative gradient of the loss function. The equation
\( W \leftarrow W – \eta ,\nabla_W ,\mathcal{L} \)illustrates the principle, where \( \eta \) is the learning rate. Modern methods like Adam or RMSProp build upon this idea by assigning an adaptively adjusted update step to each parameter component.
The real key to neural networks is the efficient computation of these gradients. Because we deal with many interconnected layers, naively taking derivatives for all weights individually would be computationally prohibitive—especially for networks with millions of parameters. Instead, one employs backpropagation. In a forward pass, the network computes all intermediate values
\( \mathbf{z}^{(l)} \) and \( \mathbf{a}^{(l)} \) layer by layer. The subsequent backward pass then determines the derivative of the loss function with respect to each layer’s outputs and weights. From a mathematical perspective, backpropagation relies on the chain rule. If the output of layer \( l \) is defined as \( \mathbf{a}^{(l)} = f\bigl(\mathbf{z}^{(l)}\bigr) \) and
\( \mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \), then one can compute the gradient
\( \frac{\partial \mathcal{L}}{\partial W^{(l)}} \)
by successively backtracking through the network. The Hadamard product (elementwise multiplication) plays an important role here, as activations and their derivatives are processed per component. In this way, one can pinpoint how a small change in each individual weight or bias affects the overall error.
Implementation of Neural Network in Python without ML Frameworks
Imports
import numpy as np
import pandas as pd
Activation functions
# Activation functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
Implementation
# Neural Network class
type NeuralNetwork:
def __init__(self, input_neurons, hidden_neurons, output_neurons, learning_rate=0.1):
self.learning_rate = learning_rate
# Initialize weights
self.weights_input_hidden = np.random.uniform(-1, 1, (input_neurons, hidden_neurons))
self.weights_hidden_output = np.random.uniform(-1, 1, (hidden_neurons, output_neurons))
# Initialize bias
self.bias_hidden = np.zeros((1, hidden_neurons))
self.bias_output = np.zeros((1, output_neurons))
def forward(self, inputs):
self.inputs = inputs
self.hidden = sigmoid(np.dot(inputs, self.weights_input_hidden) + self.bias_hidden)
self.output = sigmoid(np.dot(self.hidden, self.weights_hidden_output) + self.bias_output)
return self.output
def backward(self, expected):
error = expected - self.output
d_output = error * sigmoid_derivative(self.output)
error_hidden = d_output.dot(self.weights_hidden_output.T)
d_hidden = error_hidden * sigmoid_derivative(self.hidden)
self.weights_hidden_output += self.hidden.T.dot(d_output) * self.learning_rate
self.weights_input_hidden += self.inputs.T.dot(d_hidden) * self.learning_rate
self.bias_output += np.sum(d_output, axis=0, keepdims=True) * self.learning_rate
self.bias_hidden += np.sum(d_hidden, axis=0, keepdims=True) * self.learning_rate
Loading, preparing and running the Algorithm
# 1. Load data
df = pd.read_csv('data.csv')
# 2. Prepare data
y = df['TargetColumn'].values.reshape(-1, 1) # Adjust target column
X = df.drop(columns=['TargetColumn']).values
# Normalize data
X = X / np.max(X, axis=0)
# 3. Initialize Neural Network
nn = NeuralNetwork(input_neurons=X.shape[1], hidden_neurons=5, output_neurons=1, learning_rate=0.1)
# 4. Train Neural Network
for i in range(1000):
nn.forward(X)
nn.backward(y)
# 5. Prediction
output = nn.forward(X)
print("Predictions:", output)