Optimizer

optimizer.py (to be implemented)

This is where AdamW is defined. You will need to update the step() function based on Decoupled Weight Decay Regularization and Adam: A Method for Stochastic Optimization. There are a few slight variations on AdamW, pleae note the following:

The reference uses the "efficient" method of computing the bias correction mentioned at the end of section 2 "Algorithm" in Kigma & Ba (2014) in place of the intermediate m hat and v hat method.
The learning rate is incorporated into the weight decay update.
There is no learning rate schedule: we'll use the same alpha throughout.

You can check your optimizer implementation using optimizer_test.py.

see section 6.3 of project desc

Edited May 26, 2023 by Qumeng Sun