PyTorch vs NumPy SGD

Question

import torch
import torch.nn as nn

Convert numpy arrays to torch tensors

x_train_tensor = torch.tensor(x_train, dtype=torch.float32)
y_train_tensor = torch.tensor(((y_train + 1) // 2).reshape(-1, 1), dtype=torch.float32)

Define the logistic regression model

class LogisticRegression(nn.Module):
def init(self, input_dim):
super(LogisticRegression, self).init()
self.linear = nn.Linear(input_dim, 1)
# Initialize weights to zero
nn.init.constant_(self.linear.weight, 0)
nn.init.constant_(self.linear.bias, 0)

text
def forward(self, x):
    return torch.sigmoid(self.linear(x))

Initialize the model and loss function

input_dim = x_train.shape[1]
model = LogisticRegression(input_dim)
criterion = nn.BCELoss()

Training loop

epochs = 10
learning_rate = 0.001

for epoch in range(epochs):
model.train()
epoch_loss = 0
correct_predictions = 0

text
for i in range(x_train_tensor.size(0)):
    # Forward pass
    output = model(x_train_tensor[i].unsqueeze(0))
    loss = criterion(output, y_train_tensor[i].unsqueeze(0))

    # Calculate accuracy
    predicted = (output >= 0.5).float()
    correct_predictions += (predicted == y_train_tensor[i].unsqueeze(0)).float().item()

    # Backward pass
    model.zero_grad()
    loss.backward()

    # Manually update weights
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

    epoch_loss += loss.item()

accuracy = correct_predictions / x_train_tensor.size(0)

if (epoch+1) % 1 == 0:
    print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}, Accuracy: {accuracy:.4f}')

Predict function

def predict(model, x):
model.eval()
with torch.no_grad():
outputs = model(x)
return (outputs >= 0.5).float()

Convert x_test to torch tensor and make predictions

x_test_tensor = torch.tensor(x_test, dtype=torch.float32)
train_predictions = predict(model, x_train_tensor)
test_predictions = predict(model, x_test_tensor)

Calculate final train and test accuracy

train_accuracy = (train_predictions == y_train_tensor).float().mean()
test_accuracy = (test_predictions == torch.tensor((y_test + 1) // 2).reshape(-1, 1)).float().mean()

print(f'Final Train Accuracy: {train_accuracy.item():.4f}')
print(f'Final Test Accuracy: {test_accuracy.item():.4f}')

what special tools that pytorch internally implements to make sure the result has an error rate of 15 % eventhough it is not batched and there are no optimizers

Below is my custom implementation and it barely gets an error rate of 30 % what improvements can i make the results better without batching and using optimizers how can i just make it better is there an issue in the implementation is it faulty

class SGDWithMomentum:
"""Stochastic Gradient Descent with Momentum."""
def init(self, learning_rate, l2_penalty, momentum):
self.learning_rate = learning_rate
self.l2_penalty = l2_penalty
self.momentum = momentum
self.velocity = None # Initialize velocity to None; will be set when first used

text
def step(self, X, y, W):
    """Perform a single SGD with Momentum update."""
    if self.velocity is None:
        self.velocity = np.zeros_like(W)  # Initialize velocity with the same shape as W
    logits = X @ W
    sigmoid = 1 / (1 + np.exp(-logits))  # Sigmoid function
    # gradient = (sigmoid - y) * X + self.l2_penalty * np.r_[[0], W[1:]]  # Regularization excludes bias
    gradient = (sigmoid - y) * X + self.l2_penalty * W
    # Update velocity and weights
    self.velocity = self.momentum * self.velocity - (1-self.momentum) * gradient
    W += self.learning_rate * self.velocity

    return W

def logreg_loss(X, y, W, l2_penalty):
"""Compute the logistic regression loss for a single sample with L2 regularization."""
logits = X @ W
sigmoid = 1 / (1 + np.exp(-logits)) # Sigmoid function
loss = -y * np.log(np.clip(sigmoid, 1e-9, 1 - 1e-9)) - (1 - y) * np.log(np.clip(1 - sigmoid, 1e-9, 1 - 1e-9)) # Logistic loss for y in {0, 1} loss = -y * np.log(sigmoid) - (1 - y) * np.log(1 - sigmoid) # Logistic loss for y in {0, 1}
l2_loss = (l2_penalty / 2) * np.sum(W[1:] ** 2) # Exclude bias from L2 regularization
return loss + l2_loss

def logreg_error(X, y, W):
"""Compute the misclassification error for the dataset."""
logits = 1 / (1 + np.exp(-X @ W))
predictions = (logits >= 0.5).astype(int) # Predict 0 or 1 based on logits
return np.mean(predictions != y)

Add bias term to the data

X = np.hstack([np.ones((x_train.shape[0], 1)), x_train]) # Shape: (32561, 124)
Y = ((y_train + 1) // 2).reshape(-1, 1) # Shape: (32561, 1) and convert {-1, 1} to {0, 1}

Initialize weights

W = np.zeros(X.shape[1]) # Shape: (124,)

Hyperparameters

learning_rate = 0.001
l2_penalty = 0.0001
momentum = 0.0 # Typical value for momentum is 0.9, 0 for normal SGD
epochs = 10

Initialize SGD with Momentum optimizer

optimizer = SGDWithMomentum(learning_rate, l2_penalty, momentum)

Training

losses = []
errors = []

for epoch in range(-1,epochs):
total_loss = 0

text
# Shuffle data for SGD
indices = np.arange(X.shape[0])
# np.random.shuffle(indices)
for i in indices:
    x_sample = X[i]  # Shape: (124,)
    y_sample = Y[i]  # Shape: (1,)

    # Update weights using optimizer
    if epoch>-1:W = optimizer.step(x_sample, y_sample, W)

    # Accumulate loss
    total_loss += logreg_loss(x_sample, y_sample, W, l2_penalty)

# Compute average loss and misclassification error for the epoch
avg_loss = total_loss / X.shape[0]
avg_error = logreg_error(X, Y, W)

losses.extend(avg_loss)
errors.append(avg_error)

print(f"Epoch {epoch + 1}: Loss = {losses[-1]:.4f}, Misclassified = {errors[-1]:.4f}")

Visualize loss and misclassification error

fig, ax = plt.subplots(1, 2, figsize=(12, 5))

ax[0].plot(range(0, epochs + 1), losses, label="Loss", marker="o")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Loss")
ax[0].set_title("Loss Across Epochs")
ax[0].legend()

ax[1].plot(range(0, epochs + 1), errors, label="Misclassified Samples", marker="o", color="red")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Misclassified Samples")
ax[1].set_title("Misclassification Error Across Epochs")
ax[1].legend()

plt.tight_layout()
plt.show()

Answer 1

Below are a few points that should help explain why the PyTorch snippet can reach around a 15% error rate (even with single‐sample updates and manually applied parameter updates), whereas the custom NumPy-based SGD code might only reach around 30%. I’ll also give some improvements you can try in your own implementation to get closer to the PyTorch results—even without batching or an official “optimizer.”

1. What “special tools” does PyTorch use internally?

Even when we manually update PyTorch parameters in a loop, PyTorch’s core autograd and nn.BCELoss (or F.binary_cross_entropy) routines offer some numerical and implementation advantages:

Stable log-sigmoid computations
PyTorch’s implementation of torch.sigmoid and the internal log-based functions in nn.BCELoss use numerically stable kernels. This helps avoid exploding or vanishing values when your logits become large (positively or negatively). In NumPy, using 1 / (1 + exp(-z)) can blow up if z is large negative and can produce floating-point overflow if z is large positive.
- Even though you are using np.clip(..., 1e-9, 1 - 1e-9) in the NumPy code, you might still be hitting less stable regions than PyTorch’s carefully implemented versions (which typically rely on $\log(1 + \exp(x))$ forms internally).
Consistent and correct gradient flow
By default, PyTorch’s linear layer (nn.Linear) uses random initialization for weights (unless you explicitly zero them out). In your snippet, you did:
```
python
nn.init.constant_(self.linear.weight, 0)
nn.init.constant_(self.linear.bias, 0)
```
which indeed zeros out parameters. Although this can cause some degeneracy in certain tasks, it’s also possible that for this particular dataset, the zero initialization is not the end of the world. Still, PyTorch automatically tracks correct gradients and updates them reliably (no mismatch with the objective or indexing errors).
Accurate summation and in-place updates
PyTorch’s internal usage of GPU/CPU vectorization, fused ops, etc., can lead to slight differences in the effective floating‐point arithmetic compared to custom Python loops that do single-sample updates in NumPy. Sometimes these differences help with numerical stability.
Loss layering
criterion = nn.BCELoss() under the hood does:
$-\frac{1}{N}\sum_{i} \left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$
but also includes the stable $\log$ trick to handle edge cases. Even if we do not see it in a code snippet, PyTorch’s built-in functions handle a lot of those edge conditions for free.

None of these alone is “magic.” But together, they can give PyTorch an edge in both stability and speed of convergence, even if you do parameter updates manually.

2. Potential issues in your custom NumPy implementation

Below are key places where the custom code often deviates from “best practice” and might cause higher error:

Regularization mismatch
- You compute $\text{L2 loss} = (l2\_penalty / 2) * \sum_{i>0} W_i^2$ (excluding the bias in the summation), which is good.
- But in the gradient, you do:
```
python
gradient = (sigmoid - y) * X + self.l2_penalty * W
```
  This does penalize the bias term because you multiply the entire vector $W$ $W$ by l2_penalty. The code snippet’s comment says:
```
python
# gradient = (sigmoid - y) * X + self.l2_penalty * np.r_[[0], W[1:]]
```
  If you really want to exclude the bias from L2, you must similarly skip index 0 in the gradient step. Right now, you’re adding an extra penalty on the bias that the “theoretical” formula excludes. That mismatch can degrade your final accuracy because the bias might be shifting more slowly or incorrectly.
Initialization
- You initialize W = np.zeros(X.shape[1]). Zero initialization for logistic regression is usually less harmful than it is for deep nets, but it can still slow you down, especially if your dataset is tricky. You might experiment with a small random normal or uniform initialization.
Momentum at 0
- You’ve set momentum = 0.0. That basically disables momentum. If you want to see better gains, try a typical momentum value (e.g. 0.9). With single-sample SGD, momentum can help smooth out the updates.
Learning rate and number of epochs
- Sometimes 0.001 is too small for single-sample updates (or maybe too large for your dataset—there’s no universal best). Try stepping up the epochs to 50 or 100 to see if it converges. Also test different learning rates, like 0.01 or 0.0001, to see if you get a better plateau.
Data Shuffling
- You commented out np.random.shuffle(indices). In single-sample SGD, randomizing the order of samples each epoch is crucial to avoiding cyclical or stuck-in-local-minima behavior. If you do not reshuffle, you might traverse the samples in the same order each epoch, which can hamper your convergence.
- In your code, you wrote:
```
python
indices = np.arange(X.shape[0])
# np.random.shuffle(indices)
```
  Try un-commenting that shuffle line. Single-sample SGD typically benefits from a fresh shuffle every epoch.
Edge cases in the log and numerical stability
- You are using:
```
python
np.clip(sigmoid, 1e-9, 1 - 1e-9)
```
  That’s good for preventing NaNs, but ensure that everything is consistent. Because you compute the gradient from (sigmoid - y) * X (which does not contain that same clipping logic), large logits might still cause weird curvature. In PyTorch, the underlying code for BCELoss is more gracefully handling that log and gradient.

3. How to improve your results without batching or a standard optimizer

Even if you want to keep single-sample updates and avoid PyTorch’s built-in optimizers (torch.optim.Adam, etc.), you can still do quite a bit to bring performance and accuracy closer to PyTorch’s:

Exclude the bias from L2 in both the loss and gradient
Either penalize everything consistently or explicitly skip the bias. This mismatch is a common silent bug. For example:
```
python
gradient = (sigmoid - y) * X
# skip index 0 for bias penalty
gradient[1:] += self.l2_penalty * W[1:]
```
Initialize with small random values instead of zeros
```
python
W = 0.01 * np.random.randn(X.shape[1])
```

Shuffle your data each epoch

python
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
for i in indices:
    ...

Increase the number of epochs
With a single sample update (and especially with a small learning rate like 0.001), 10 epochs might not be enough. You can easily go to 50–100 epochs or more with logistic regression.
Try a momentum value around 0.9
This can help your updates escape small local traps and also reduce variance in single-sample updates:
```
python
momentum = 0.9
```
Adjust learning rate
If your error does not improve after many epochs, try a bigger or smaller $\alpha$ . Start with 0.001, try 0.01, etc.
Check for data preprocessing
Sometimes PyTorch’s example code normalizes or standardizes input features. If you are feeding raw data in one case and standardized data in another, that alone could explain the difference in final accuracy.
Verify your label conversion
Make sure your final “0/1” labels match exactly in both the PyTorch version and the custom version. Typos in the sign flipping or in the thresholding code can creep in.
Test multiple seeds
For a fair comparison, set the same random seed for everything—both PyTorch and NumPy. Even minor differences can lead to large divergences in single‐sample training.

Putting it all together

In short, there’s no single “secret sauce” in PyTorch that automatically grants ~15% error for logistic regression while your code hovers at ~30%. The main factors are:

Numerically stable logistic/BCELoss implementation
Slight differences in initialization
Accurate gradient flow
Shuffling
Consistent regularization

By carefully aligning these practices in your custom code (particularly the gradient regularization, the initialization, the data shuffle, and enough epochs), you should see your error rate drop significantly—very likely closer to PyTorch’s results, even without batching or a more advanced optimizer.

PyTorch vs NumPy SGD

Domanda

Convert numpy arrays to torch tensors

Define the logistic regression model

Initialize the model and loss function

Training loop

Predict function

Convert x_test to torch tensor and make predictions

Calculate final train and test accuracy

Add bias term to the data

Initialize weights

Hyperparameters

Initialize SGD with Momentum optimizer

Training

Visualize loss and misclassification error

Pensiero

Clarifying the differentiation

Improving performance

Considering bias and initialization

Adjusting for consistency

Incorporating stability and optimization

Risposta

1. What “special tools” does PyTorch use internally?

2. Potential issues in your custom NumPy implementation

3. How to improve your results without batching or a standard optimizer

Putting it all together

Condividi questa Q&A