Optimizer
optimizer.py (to be implemented)
This is where AdamW
is defined.
You will need to update the step()
function based on Decoupled Weight Decay Regularization and Adam: A Method for Stochastic Optimization.
There are a few slight variations on AdamW, pleae note the following:
- The reference uses the "efficient" method of computing the bias correction mentioned at the end of section 2 "Algorithm" in Kigma & Ba (2014) in place of the intermediate m hat and v hat method.
- The learning rate is incorporated into the weight decay update.
- There is no learning rate schedule: we'll use the same alpha throughout.
You can check your optimizer implementation using optimizer_test.py
.
see section 6.3 of project desc
Edited by Qumeng Sun