PyTorch vs NumPy SGD
Erstellt am: 10. Januar 2025
Erstellt am: 10. Januar 2025
The user is curious about the distinctions between logreg code in PyTorch and custom SGD code. They aim for PyTorch's tools to achieve 15% error and seek ways to enhance their custom code's performance.
Using PyTorch could involve tweaking factors like data shuffling, learning rate adjustment, and more training epochs. It’s worth considering these tweaks to enhance the custom code's error rate.
Contemplating whether to include bias in the penalty and the impact of PyTorch's random weight initialization. This points to potential paths for tweaks to enhance error rate outcomes.
I'm adjusting the gradient formula to exclude the bias, ensuring consistent L2 penalty calculation. This change should bring the code in line with PyTorch's stable formulær.
I’m piecing together stable log-sum-exp for the logistic function, embracing shuffling, expanding epochs for mini-batch gradient descent, all while using techniques like np.clip for numerical stability.
Thought about logistic regression optimization for 11 seconds
Below are a few points that should help explain why the PyTorch snippet can reach around a 15% error rate (even with single‐sample updates and manually applied parameter updates), whereas the custom NumPy-based SGD code might only reach around 30%. I’ll also give some improvements you can try in your own implementation to get closer to the PyTorch results—even without batching or an official “optimizer.”
Even when we manually update PyTorch parameters in a loop, PyTorch’s core autograd and nn.BCELoss
(or F.binary_cross_entropy
) routines offer some numerical and implementation advantages:
Stable log-sigmoid computations
PyTorch’s implementation of torch.sigmoid
and the internal log-based functions in nn.BCELoss
use numerically stable kernels. This helps avoid exploding or vanishing values when your logits become large (positively or negatively). In NumPy, using 1 / (1 + exp(-z))
can blow up if z
is large negative and can produce floating-point overflow if z
is large positive.
np.clip(..., 1e-9, 1 - 1e-9)
in the NumPy code, you might still be hitting less stable regions than PyTorch’s carefully implemented versions (which typically rely on forms internally).Consistent and correct gradient flow
By default, PyTorch’s linear layer (nn.Linear
) uses random initialization for weights (unless you explicitly zero them out). In your snippet, you did:
nn.init.constant_(self.linear.weight, 0) nn.init.constant_(self.linear.bias, 0)
which indeed zeros out parameters. Although this can cause some degeneracy in certain tasks, it’s also possible that for this particular dataset, the zero initialization is not the end of the world. Still, PyTorch automatically tracks correct gradients and updates them reliably (no mismatch with the objective or indexing errors).
Accurate summation and in-place updates
PyTorch’s internal usage of GPU/CPU vectorization, fused ops, etc., can lead to slight differences in the effective floating‐point arithmetic compared to custom Python loops that do single-sample updates in NumPy. Sometimes these differences help with numerical stability.
Loss layering
criterion = nn.BCELoss()
under the hood does:
but also includes the stable trick to handle edge cases. Even if we do not see it in a code snippet, PyTorch’s built-in functions handle a lot of those edge conditions for free.
None of these alone is “magic.” But together, they can give PyTorch an edge in both stability and speed of convergence, even if you do parameter updates manually.
Below are key places where the custom code often deviates from “best practice” and might cause higher error:
Regularization mismatch
This does penalize the bias term because you multiply the entire vector bygradient = (sigmoid - y) * X + self.l2_penalty * W
l2_penalty
. The code snippet’s comment says:
If you really want to exclude the bias from L2, you must similarly skip index# gradient = (sigmoid - y) * X + self.l2_penalty * np.r_[[0], W[1:]]
0
in the gradient step. Right now, you’re adding an extra penalty on the bias that the “theoretical” formula excludes. That mismatch can degrade your final accuracy because the bias might be shifting more slowly or incorrectly.Initialization
W = np.zeros(X.shape[1])
. Zero initialization for logistic regression is usually less harmful than it is for deep nets, but it can still slow you down, especially if your dataset is tricky. You might experiment with a small random normal or uniform initialization.Momentum at 0
momentum = 0.0
. That basically disables momentum. If you want to see better gains, try a typical momentum value (e.g. 0.9). With single-sample SGD, momentum can help smooth out the updates.Learning rate and number of epochs
Data Shuffling
np.random.shuffle(indices)
. In single-sample SGD, randomizing the order of samples each epoch is crucial to avoiding cyclical or stuck-in-local-minima behavior. If you do not reshuffle, you might traverse the samples in the same order each epoch, which can hamper your convergence.Try un-commenting that shuffle line. Single-sample SGD typically benefits from a fresh shuffle every epoch.indices = np.arange(X.shape[0]) # np.random.shuffle(indices)
Edge cases in the log and numerical stability
That’s good for preventing NaNs, but ensure that everything is consistent. Because you compute the gradient fromnp.clip(sigmoid, 1e-9, 1 - 1e-9)
(sigmoid - y) * X
(which does not contain that same clipping logic), large logits might still cause weird curvature. In PyTorch, the underlying code for BCELoss
is more gracefully handling that log and gradient.Even if you want to keep single-sample updates and avoid PyTorch’s built-in optimizers (torch.optim.Adam
, etc.), you can still do quite a bit to bring performance and accuracy closer to PyTorch’s:
Exclude the bias from L2 in both the loss and gradient
Either penalize everything consistently or explicitly skip the bias. This mismatch is a common silent bug. For example:
gradient = (sigmoid - y) * X # skip index 0 for bias penalty gradient[1:] += self.l2_penalty * W[1:]
Initialize with small random values instead of zeros
W = 0.01 * np.random.randn(X.shape[1])
Shuffle your data each epoch
indices = np.arange(X.shape[0]) np.random.shuffle(indices) for i in indices: ...
Increase the number of epochs
With a single sample update (and especially with a small learning rate like 0.001), 10 epochs might not be enough. You can easily go to 50–100 epochs or more with logistic regression.
Try a momentum value around 0.9
This can help your updates escape small local traps and also reduce variance in single-sample updates:
momentum = 0.9
Adjust learning rate
If your error does not improve after many epochs, try a bigger or smaller . Start with 0.001, try 0.01, etc.
Check for data preprocessing
Sometimes PyTorch’s example code normalizes or standardizes input features. If you are feeding raw data in one case and standardized data in another, that alone could explain the difference in final accuracy.
Verify your label conversion
Make sure your final “0/1” labels match exactly in both the PyTorch version and the custom version. Typos in the sign flipping or in the thresholding code can creep in.
Test multiple seeds
For a fair comparison, set the same random seed for everything—both PyTorch and NumPy. Even minor differences can lead to large divergences in single‐sample training.
In short, there’s no single “secret sauce” in PyTorch that automatically grants ~15% error for logistic regression while your code hovers at ~30%. The main factors are:
By carefully aligning these practices in your custom code (particularly the gradient regularization, the initialization, the data shuffle, and enough epochs), you should see your error rate drop significantly—very likely closer to PyTorch’s results, even without batching or a more advanced optimizer.