PyTorch vs NumPy SGD
Creato il: 10 gennaio 2025
Creato il: 10 gennaio 2025
import torch
import torch.nn as nn
x_train_tensor = torch.tensor(x_train, dtype=torch.float32)
y_train_tensor = torch.tensor(((y_train + 1) // 2).reshape(-1, 1), dtype=torch.float32)
class LogisticRegression(nn.Module):
def init(self, input_dim):
super(LogisticRegression, self).init()
self.linear = nn.Linear(input_dim, 1)
# Initialize weights to zero
nn.init.constant_(self.linear.weight, 0)
nn.init.constant_(self.linear.bias, 0)
textdef forward(self, x): return torch.sigmoid(self.linear(x))
input_dim = x_train.shape[1]
model = LogisticRegression(input_dim)
criterion = nn.BCELoss()
epochs = 10
learning_rate = 0.001
for epoch in range(epochs):
model.train()
epoch_loss = 0
correct_predictions = 0
textfor i in range(x_train_tensor.size(0)): # Forward pass output = model(x_train_tensor[i].unsqueeze(0)) loss = criterion(output, y_train_tensor[i].unsqueeze(0)) # Calculate accuracy predicted = (output >= 0.5).float() correct_predictions += (predicted == y_train_tensor[i].unsqueeze(0)).float().item() # Backward pass model.zero_grad() loss.backward() # Manually update weights with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad epoch_loss += loss.item() accuracy = correct_predictions / x_train_tensor.size(0) if (epoch+1) % 1 == 0: print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}, Accuracy: {accuracy:.4f}')
def predict(model, x):
model.eval()
with torch.no_grad():
outputs = model(x)
return (outputs >= 0.5).float()
x_test_tensor = torch.tensor(x_test, dtype=torch.float32)
train_predictions = predict(model, x_train_tensor)
test_predictions = predict(model, x_test_tensor)
train_accuracy = (train_predictions == y_train_tensor).float().mean()
test_accuracy = (test_predictions == torch.tensor((y_test + 1) // 2).reshape(-1, 1)).float().mean()
print(f'Final Train Accuracy: {train_accuracy.item():.4f}')
print(f'Final Test Accuracy: {test_accuracy.item():.4f}')
what special tools that pytorch internally implements to make sure the result has an error rate of 15 % eventhough it is not batched and there are no optimizers
Below is my custom implementation and it barely gets an error rate of 30 % what improvements can i make the results better without batching and using optimizers how can i just make it better is there an issue in the implementation is it faulty
class SGDWithMomentum:
"""Stochastic Gradient Descent with Momentum."""
def init(self, learning_rate, l2_penalty, momentum):
self.learning_rate = learning_rate
self.l2_penalty = l2_penalty
self.momentum = momentum
self.velocity = None # Initialize velocity to None; will be set when first used
textdef step(self, X, y, W): """Perform a single SGD with Momentum update.""" if self.velocity is None: self.velocity = np.zeros_like(W) # Initialize velocity with the same shape as W logits = X @ W sigmoid = 1 / (1 + np.exp(-logits)) # Sigmoid function # gradient = (sigmoid - y) * X + self.l2_penalty * np.r_[[0], W[1:]] # Regularization excludes bias gradient = (sigmoid - y) * X + self.l2_penalty * W # Update velocity and weights self.velocity = self.momentum * self.velocity - (1-self.momentum) * gradient W += self.learning_rate * self.velocity return W
def logreg_loss(X, y, W, l2_penalty):
"""Compute the logistic regression loss for a single sample with L2 regularization."""
logits = X @ W
sigmoid = 1 / (1 + np.exp(-logits)) # Sigmoid function
loss = -y * np.log(np.clip(sigmoid, 1e-9, 1 - 1e-9)) - (1 - y) * np.log(np.clip(1 - sigmoid, 1e-9, 1 - 1e-9)) # Logistic loss for y in {0, 1} loss = -y * np.log(sigmoid) - (1 - y) * np.log(1 - sigmoid) # Logistic loss for y in {0, 1}
l2_loss = (l2_penalty / 2) * np.sum(W[1:] ** 2) # Exclude bias from L2 regularization
return loss + l2_loss
def logreg_error(X, y, W):
"""Compute the misclassification error for the dataset."""
logits = 1 / (1 + np.exp(-X @ W))
predictions = (logits >= 0.5).astype(int) # Predict 0 or 1 based on logits
return np.mean(predictions != y)
X = np.hstack([np.ones((x_train.shape[0], 1)), x_train]) # Shape: (32561, 124)
Y = ((y_train + 1) // 2).reshape(-1, 1) # Shape: (32561, 1) and convert {-1, 1} to {0, 1}
W = np.zeros(X.shape[1]) # Shape: (124,)
learning_rate = 0.001
l2_penalty = 0.0001
momentum = 0.0 # Typical value for momentum is 0.9, 0 for normal SGD
epochs = 10
optimizer = SGDWithMomentum(learning_rate, l2_penalty, momentum)
losses = []
errors = []
for epoch in range(-1,epochs):
total_loss = 0
text# Shuffle data for SGD indices = np.arange(X.shape[0]) # np.random.shuffle(indices) for i in indices: x_sample = X[i] # Shape: (124,) y_sample = Y[i] # Shape: (1,) # Update weights using optimizer if epoch>-1:W = optimizer.step(x_sample, y_sample, W) # Accumulate loss total_loss += logreg_loss(x_sample, y_sample, W, l2_penalty) # Compute average loss and misclassification error for the epoch avg_loss = total_loss / X.shape[0] avg_error = logreg_error(X, Y, W) losses.extend(avg_loss) errors.append(avg_error) print(f"Epoch {epoch + 1}: Loss = {losses[-1]:.4f}, Misclassified = {errors[-1]:.4f}")
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].plot(range(0, epochs + 1), losses, label="Loss", marker="o")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Loss")
ax[0].set_title("Loss Across Epochs")
ax[0].legend()
ax[1].plot(range(0, epochs + 1), errors, label="Misclassified Samples", marker="o", color="red")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Misclassified Samples")
ax[1].set_title("Misclassification Error Across Epochs")
ax[1].legend()
plt.tight_layout()
plt.show()
Below are a few points that should help explain why the PyTorch snippet can reach around a 15% error rate (even with single‐sample updates and manually applied parameter updates), whereas the custom NumPy-based SGD code might only reach around 30%. I’ll also give some improvements you can try in your own implementation to get closer to the PyTorch results—even without batching or an official “optimizer.”
Even when we manually update PyTorch parameters in a loop, PyTorch’s core autograd and nn.BCELoss
(or F.binary_cross_entropy
) routines offer some numerical and implementation advantages:
Stable log-sigmoid computations
PyTorch’s implementation of torch.sigmoid
and the internal log-based functions in nn.BCELoss
use numerically stable kernels. This helps avoid exploding or vanishing values when your logits become large (positively or negatively). In NumPy, using 1 / (1 + exp(-z))
can blow up if z
is large negative and can produce floating-point overflow if z
is large positive.
np.clip(..., 1e-9, 1 - 1e-9)
in the NumPy code, you might still be hitting less stable regions than PyTorch’s carefully implemented versions (which typically rely on forms internally).Consistent and correct gradient flow
By default, PyTorch’s linear layer (nn.Linear
) uses random initialization for weights (unless you explicitly zero them out). In your snippet, you did:
pythonnn.init.constant_(self.linear.weight, 0) nn.init.constant_(self.linear.bias, 0)
which indeed zeros out parameters. Although this can cause some degeneracy in certain tasks, it’s also possible that for this particular dataset, the zero initialization is not the end of the world. Still, PyTorch automatically tracks correct gradients and updates them reliably (no mismatch with the objective or indexing errors).
Accurate summation and in-place updates
PyTorch’s internal usage of GPU/CPU vectorization, fused ops, etc., can lead to slight differences in the effective floating‐point arithmetic compared to custom Python loops that do single-sample updates in NumPy. Sometimes these differences help with numerical stability.
Loss layering
criterion = nn.BCELoss()
under the hood does:
but also includes the stable trick to handle edge cases. Even if we do not see it in a code snippet, PyTorch’s built-in functions handle a lot of those edge conditions for free.
None of these alone is “magic.” But together, they can give PyTorch an edge in both stability and speed of convergence, even if you do parameter updates manually.
Below are key places where the custom code often deviates from “best practice” and might cause higher error:
Regularization mismatch
This does penalize the bias term because you multiply the entire vector bypythongradient = (sigmoid - y) * X + self.l2_penalty * W
l2_penalty
. The code snippet’s comment says:
If you really want to exclude the bias from L2, you must similarly skip indexpython# gradient = (sigmoid - y) * X + self.l2_penalty * np.r_[[0], W[1:]]
0
in the gradient step. Right now, you’re adding an extra penalty on the bias that the “theoretical” formula excludes. That mismatch can degrade your final accuracy because the bias might be shifting more slowly or incorrectly.Initialization
W = np.zeros(X.shape[1])
. Zero initialization for logistic regression is usually less harmful than it is for deep nets, but it can still slow you down, especially if your dataset is tricky. You might experiment with a small random normal or uniform initialization.Momentum at 0
momentum = 0.0
. That basically disables momentum. If you want to see better gains, try a typical momentum value (e.g. 0.9). With single-sample SGD, momentum can help smooth out the updates.Learning rate and number of epochs
Data Shuffling
np.random.shuffle(indices)
. In single-sample SGD, randomizing the order of samples each epoch is crucial to avoiding cyclical or stuck-in-local-minima behavior. If you do not reshuffle, you might traverse the samples in the same order each epoch, which can hamper your convergence.Try un-commenting that shuffle line. Single-sample SGD typically benefits from a fresh shuffle every epoch.pythonindices = np.arange(X.shape[0]) # np.random.shuffle(indices)
Edge cases in the log and numerical stability
That’s good for preventing NaNs, but ensure that everything is consistent. Because you compute the gradient frompythonnp.clip(sigmoid, 1e-9, 1 - 1e-9)
(sigmoid - y) * X
(which does not contain that same clipping logic), large logits might still cause weird curvature. In PyTorch, the underlying code for BCELoss
is more gracefully handling that log and gradient.Even if you want to keep single-sample updates and avoid PyTorch’s built-in optimizers (torch.optim.Adam
, etc.), you can still do quite a bit to bring performance and accuracy closer to PyTorch’s:
Exclude the bias from L2 in both the loss and gradient
Either penalize everything consistently or explicitly skip the bias. This mismatch is a common silent bug. For example:
pythongradient = (sigmoid - y) * X # skip index 0 for bias penalty gradient[1:] += self.l2_penalty * W[1:]
Initialize with small random values instead of zeros
pythonW = 0.01 * np.random.randn(X.shape[1])
Shuffle your data each epoch
pythonindices = np.arange(X.shape[0]) np.random.shuffle(indices) for i in indices: ...
Increase the number of epochs
With a single sample update (and especially with a small learning rate like 0.001), 10 epochs might not be enough. You can easily go to 50–100 epochs or more with logistic regression.
Try a momentum value around 0.9
This can help your updates escape small local traps and also reduce variance in single-sample updates:
pythonmomentum = 0.9
Adjust learning rate
If your error does not improve after many epochs, try a bigger or smaller . Start with 0.001, try 0.01, etc.
Check for data preprocessing
Sometimes PyTorch’s example code normalizes or standardizes input features. If you are feeding raw data in one case and standardized data in another, that alone could explain the difference in final accuracy.
Verify your label conversion
Make sure your final “0/1” labels match exactly in both the PyTorch version and the custom version. Typos in the sign flipping or in the thresholding code can creep in.
Test multiple seeds
For a fair comparison, set the same random seed for everything—both PyTorch and NumPy. Even minor differences can lead to large divergences in single‐sample training.
In short, there’s no single “secret sauce” in PyTorch that automatically grants ~15% error for logistic regression while your code hovers at ~30%. The main factors are:
By carefully aligning these practices in your custom code (particularly the gradient regularization, the initialization, the data shuffle, and enough epochs), you should see your error rate drop significantly—very likely closer to PyTorch’s results, even without batching or a more advanced optimizer.