Skip to content

Commit eeae25f

Browse files
Frighteratensorflower-gardener
authored andcommitted
PR #18124: Some cleanup // Optimizers
Imported from GitHub PR #18124 This PR: * Migrates docstrings into 4 indents * Removes unused args in adam & adamw Copybara import of the project: -- f8c2982 by Kaan Bıçakcı <[email protected]>: Remove unused args -- 0775e6b by Kaan Bıçakcı <[email protected]>: adamw docstring update -- 7999975 by Kaan Bıçakcı <[email protected]>: Adadelta docstring update -- a99442b by Kaan Bıçakcı <[email protected]>: adafactor docstring update -- 070dd88 by Kaan Bıçakcı <[email protected]>: adagrad docstring update -- 49e5689 by Kaan Bıçakcı <[email protected]>: Remove unused args // adam -- 6b0a4da by Kaan Bıçakcı <[email protected]>: Adam docstring update -- 53a87b9 by Kaan Bıçakcı <[email protected]>: FTRL docstring update -- e47d096 by Kaan Bıçakcı <[email protected]>: Adamax docstring update -- 3811ee2 by Kaan Bıçakcı <[email protected]>: Lion docstring update -- 7452374 by Kaan Bıçakcı <[email protected]>: Nadam docstring update -- 11d682e by Kaan Bıçakcı <[email protected]>: RMS docstring update -- 69070b4 by Kaan Bıçakcı <[email protected]>: Fix linting -- 7fe95d6 by Kaan Bıçakcı <[email protected]>: Update indent for unupdated params -- 4855354 by Kaan Bıçakcı <[email protected]>: SGD docstring update -- 3b3403c by Kaan Bıçakcı <[email protected]>: Fix adagrad indent Merging this change closes #18124 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18124 from Frightera:cleanup 3b3403c PiperOrigin-RevId: 534171670
1 parent 88cb20d commit eeae25f

File tree

11 files changed

+180
-177
lines changed

11 files changed

+180
-177
lines changed

keras/optimizers/adadelta.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -47,19 +47,20 @@ class Adadelta(optimizer.Optimizer):
4747
learning rate can be set, as in most other Keras optimizers.
4848
4949
Args:
50-
learning_rate: Initial value for the learning rate: either a floating
51-
point value, or a `tf.keras.optimizers.schedules.LearningRateSchedule`
52-
instance. Defaults to 0.001. Note that `Adadelta` tends to benefit from
53-
higher initial learning rate values compared to other optimizers. To
54-
match the exact form in the original paper, use 1.0.
55-
rho: A `Tensor` or a floating point value. The decay rate. Defaults to
56-
0.95.
57-
epsilon: Small floating point value used to maintain numerical stability.
58-
Defaults to 1e-7.
50+
learning_rate: Initial value for the learning rate: either a floating
51+
point value, or a
52+
`tf.keras.optimizers.schedules.LearningRateSchedule` instance.
53+
Defaults to 0.001. Note that `Adadelta` tends to benefit from
54+
higher initial learning rate values compared to other optimizers. To
55+
match the exact form in the original paper, use 1.0.
56+
rho: A `Tensor` or a floating point value. The decay rate. Defaults to
57+
0.95.
58+
epsilon: Small floating point value used to maintain numerical
59+
stability. Defaults to 1e-7.
5960
{{base_optimizer_keyword_args}}
6061
6162
Reference:
62-
- [Zeiler, 2012](http://arxiv.org/abs/1212.5701)
63+
- [Zeiler, 2012](http://arxiv.org/abs/1212.5701)
6364
"""
6465

6566
def __init__(

keras/optimizers/adafactor.py

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -42,26 +42,26 @@ class Adafactor(optimizer.Optimizer):
4242
last 2 dimensions separately in its accumulator variables.
4343
4444
Args:
45-
learning_rate: Initial value for the learning rate:
46-
either a floating point value,
47-
or a `tf.keras.optimizers.schedules.LearningRateSchedule` instance.
48-
Defaults to 0.001.
49-
beta_2_decay: float, defaults to -0.8. The decay rate of `beta_2`.
50-
epsilon_1: float, defaults to 1e-30. A small offset to keep demoninator
51-
away from 0.
52-
epsilon_2: float, defaults to 1e-3. A small offset to avoid learning
53-
rate becoming too small by time.
54-
clip_threshold: float, defaults to 1.0. Clipping threshold. This is a part
55-
of Adafactor algorithm, independent from `clipnorm`, `clipvalue` and
56-
`global_clipnorm`.
57-
relative_step: bool, defaults to True. If `learning_rate` is a
58-
constant and `relative_step=True`, learning rate will be adjusted
59-
based on current iterations. This is a default learning rate decay
60-
in Adafactor.
45+
learning_rate: Initial value for the learning rate:
46+
either a floating point value,
47+
or a `tf.keras.optimizers.schedules.LearningRateSchedule` instance.
48+
Defaults to 0.001.
49+
beta_2_decay: float, defaults to -0.8. The decay rate of `beta_2`.
50+
epsilon_1: float, defaults to 1e-30. A small offset to keep denominator
51+
away from 0.
52+
epsilon_2: float, defaults to 1e-3. A small offset to avoid learning
53+
rate becoming too small by time.
54+
clip_threshold: float, defaults to 1.0. Clipping threshold. This is a
55+
part of Adafactor algorithm, independent from `clipnorm`,
56+
`clipvalue` and `global_clipnorm`.
57+
relative_step: bool, defaults to True. If `learning_rate` is a
58+
constant and `relative_step=True`, learning rate will be adjusted
59+
based on current iterations. This is a default learning rate decay
60+
in Adafactor.
6161
{{base_optimizer_keyword_args}}
6262
6363
Reference:
64-
- [Shazeer, Noam et al., 2018](https://arxiv.org/abs/1804.04235).
64+
- [Shazeer, Noam et al., 2018](https://arxiv.org/abs/1804.04235).
6565
6666
"""
6767

@@ -110,7 +110,7 @@ def build(self, var_list):
110110
velocity_hat (only set when amsgrad is applied),
111111
112112
Args:
113-
var_list: list of model variables to build Adam variables on.
113+
var_list: list of model variables to build Adam variables on.
114114
"""
115115
super().build(var_list)
116116
if hasattr(self, "_built") and self._built:

keras/optimizers/adagrad.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -40,22 +40,22 @@ class Adagrad(optimizer.Optimizer):
4040
the smaller the updates.
4141
4242
Args:
43-
learning_rate: Initial value for the learning rate:
44-
either a floating point value,
45-
or a `tf.keras.optimizers.schedules.LearningRateSchedule` instance.
46-
Defaults to 0.001.
47-
Note that `Adagrad` tends to benefit from higher initial learning rate
48-
values compared to other optimizers.
49-
To match the exact form in the original paper, use 1.0.
50-
initial_accumulator_value: Floating point value.
51-
Starting value for the accumulators (per-parameter momentum values).
52-
Must be non-negative.
53-
epsilon: Small floating point value used to maintain numerical stability.
54-
{{base_optimizer_keyword_args}}
43+
learning_rate: Initial value for the learning rate:
44+
either a floating point value,
45+
or a `tf.keras.optimizers.schedules.LearningRateSchedule` instance.
46+
Defaults to 0.001. Note that `Adagrad` tends to benefit from higher
47+
initial learning rate values compared to other optimizers. To match
48+
the exact form in the original paper, use 1.0.
49+
initial_accumulator_value: Floating point value.
50+
Starting value for the accumulators (per-parameter momentum values).
51+
Must be non-negative.
52+
epsilon: Small floating point value used to maintain numerical
53+
stability.
54+
{{base_optimizer_keyword_args}}
5555
5656
Reference:
57-
- [Duchi et al., 2011](
58-
http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).
57+
- [Duchi et al., 2011](
58+
http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).
5959
"""
6060

6161
def __init__(

keras/optimizers/adam.py

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -44,29 +44,31 @@ class Adam(optimizer.Optimizer):
4444
data/parameters*".
4545
4646
Args:
47-
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
48-
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
49-
that takes no arguments and returns the actual value to use. The
50-
learning rate. Defaults to `0.001`.
51-
beta_1: A float value or a constant float tensor, or a callable
52-
that takes no arguments and returns the actual value to use. The
53-
exponential decay rate for the 1st moment estimates. Defaults to `0.9`.
54-
beta_2: A float value or a constant float tensor, or a callable
55-
that takes no arguments and returns the actual value to use. The
56-
exponential decay rate for the 2nd moment estimates. Defaults to
57-
`0.999`.
58-
epsilon: A small constant for numerical stability. This epsilon is
59-
"epsilon hat" in the Kingma and Ba paper (in the formula just before
60-
Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to
61-
`1e-7`.
62-
amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm from
63-
the paper "On the Convergence of Adam and beyond". Defaults to `False`.
64-
{{base_optimizer_keyword_args}}
47+
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
48+
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
49+
that takes no arguments and returns the actual value to use. The
50+
learning rate. Defaults to `0.001`.
51+
beta_1: A float value or a constant float tensor, or a callable
52+
that takes no arguments and returns the actual value to use. The
53+
exponential decay rate for the 1st moment estimates.
54+
Defaults to `0.9`.
55+
beta_2: A float value or a constant float tensor, or a callable
56+
that takes no arguments and returns the actual value to use. The
57+
exponential decay rate for the 2nd moment estimates.
58+
Defaults to `0.999`.
59+
epsilon: A small constant for numerical stability. This epsilon is
60+
"epsilon hat" in the Kingma and Ba paper (in the formula just before
61+
Section 2.1), not the epsilon in Algorithm 1 of the paper.
62+
Defaults to `1e-7`.
63+
amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm
64+
from the paper "On the Convergence of Adam and beyond".
65+
Defaults to `False`.
66+
{{base_optimizer_keyword_args}}
6567
6668
Reference:
67-
- [Kingma et al., 2014](http://arxiv.org/abs/1412.6980)
68-
- [Reddi et al., 2018](
69-
https://openreview.net/pdf?id=ryQu7f-RZ) for `amsgrad`.
69+
- [Kingma et al., 2014](http://arxiv.org/abs/1412.6980)
70+
- [Reddi et al., 2018](
71+
https://openreview.net/pdf?id=ryQu7f-RZ) for `amsgrad`.
7072
7173
Notes:
7274
@@ -130,7 +132,7 @@ def build(self, var_list):
130132
velocity_hat (only set when amsgrad is applied),
131133
132134
Args:
133-
var_list: list of model variables to build Adam variables on.
135+
var_list: list of model variables to build Adam variables on.
134136
"""
135137
super().build(var_list)
136138
if hasattr(self, "_built") and self._built:
@@ -160,8 +162,6 @@ def build(self, var_list):
160162

161163
def update_step(self, gradient, variable):
162164
"""Update step given gradient and the associated model variable."""
163-
beta_1_power = None
164-
beta_2_power = None
165165
lr = tf.cast(self.learning_rate, variable.dtype)
166166
local_step = tf.cast(self.iterations + 1, variable.dtype)
167167
beta_1_power = tf.pow(tf.cast(self.beta_1, variable.dtype), local_step)

keras/optimizers/adamax.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -57,19 +57,19 @@ class Adamax(optimizer.Optimizer):
5757
```
5858
5959
Args:
60-
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
61-
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
62-
that takes no arguments and returns the actual value to use. The
63-
learning rate. Defaults to `0.001`.
64-
beta_1: A float value or a constant float tensor. The exponential decay
65-
rate for the 1st moment estimates.
66-
beta_2: A float value or a constant float tensor. The exponential decay
67-
rate for the exponentially weighted infinity norm.
68-
epsilon: A small constant for numerical stability.
69-
{{base_optimizer_keyword_args}}
60+
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
61+
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
62+
that takes no arguments and returns the actual value to use. The
63+
learning rate. Defaults to `0.001`.
64+
beta_1: A float value or a constant float tensor. The exponential decay
65+
rate for the 1st moment estimates.
66+
beta_2: A float value or a constant float tensor. The exponential decay
67+
rate for the exponentially weighted infinity norm.
68+
epsilon: A small constant for numerical stability.
69+
{{base_optimizer_keyword_args}}
7070
7171
Reference:
72-
- [Kingma et al., 2014](http://arxiv.org/abs/1412.6980)
72+
- [Kingma et al., 2014](http://arxiv.org/abs/1412.6980)
7373
"""
7474

7575
def __init__(
@@ -113,7 +113,7 @@ def build(self, var_list):
113113
exponentially weighted infinity norm (denoted as u).
114114
115115
Args:
116-
var_list: list of model variables to build Adamax variables on.
116+
var_list: list of model variables to build Adamax variables on.
117117
"""
118118
super().build(var_list)
119119
if hasattr(self, "_built") and self._built:

keras/optimizers/adamw.py

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -48,23 +48,26 @@ class AdamW(optimizer.Optimizer):
4848
data/parameters*".
4949
5050
Args:
51-
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
52-
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
53-
that takes no arguments and returns the actual value to use. The
54-
learning rate. Defaults to 0.001.
55-
beta_1: A float value or a constant float tensor, or a callable
56-
that takes no arguments and returns the actual value to use. The
57-
exponential decay rate for the 1st moment estimates. Defaults to 0.9.
58-
beta_2: A float value or a constant float tensor, or a callable
59-
that takes no arguments and returns the actual value to use. The
60-
exponential decay rate for the 2nd moment estimates. Defaults to 0.999.
61-
epsilon: A small constant for numerical stability. This epsilon is
62-
"epsilon hat" in the Kingma and Ba paper (in the formula just before
63-
Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to
64-
1e-7.
65-
amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm from
66-
the paper "On the Convergence of Adam and beyond". Defaults to `False`.
67-
{{base_optimizer_keyword_args}}
51+
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
52+
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
53+
that takes no arguments and returns the actual value to use. The
54+
learning rate. Defaults to 0.001.
55+
beta_1: A float value or a constant float tensor, or a callable
56+
that takes no arguments and returns the actual value to use. The
57+
exponential decay rate for the 1st moment estimates.
58+
Defaults to 0.9.
59+
beta_2: A float value or a constant float tensor, or a callable
60+
that takes no arguments and returns the actual value to use. The
61+
exponential decay rate for the 2nd moment estimates.
62+
Defaults to 0.999.
63+
epsilon: A small constant for numerical stability. This epsilon is
64+
"epsilon hat" in the Kingma and Ba paper (in the formula just before
65+
Section 2.1), not the epsilon in Algorithm 1 of the paper.
66+
Defaults to 1e-7.
67+
amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm
68+
from the paper "On the Convergence of Adam and beyond".
69+
Defaults to `False`.
70+
{{base_optimizer_keyword_args}}
6871
6972
Reference:
7073
- [Loshchilov et al., 2019](https://arxiv.org/abs/1711.05101)
@@ -163,8 +166,6 @@ def build(self, var_list):
163166

164167
def update_step(self, gradient, variable):
165168
"""Update step given gradient and the associated model variable."""
166-
beta_1_power = None
167-
beta_2_power = None
168169
lr = tf.cast(self.learning_rate, variable.dtype)
169170
local_step = tf.cast(self.iterations + 1, variable.dtype)
170171
beta_1_power = tf.pow(tf.cast(self.beta_1, variable.dtype), local_step)

keras/optimizers/ftrl.py

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -74,26 +74,27 @@ class Ftrl(optimizer.Optimizer):
7474
is replaced with a gradient with shrinkage.
7575
7676
Args:
77-
learning_rate: A `Tensor`, floating point value, a schedule that is a
78-
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable that
79-
takes no arguments and returns the actual value to use. The learning
80-
rate. Defaults to `0.001`.
81-
learning_rate_power: A float value, must be less or equal to zero.
82-
Controls how the learning rate decreases during training. Use zero for a
83-
fixed learning rate.
84-
initial_accumulator_value: The starting value for accumulators. Only zero
85-
or positive values are allowed.
86-
l1_regularization_strength: A float value, must be greater than or equal
87-
to zero. Defaults to `0.0`.
88-
l2_regularization_strength: A float value, must be greater than or equal
89-
to zero. Defaults to `0.0`.
90-
l2_shrinkage_regularization_strength: A float value, must be greater than
91-
or equal to zero. This differs from L2 above in that the L2 above is a
92-
stabilization penalty, whereas this L2 shrinkage is a magnitude penalty.
93-
When input is sparse shrinkage will only happen on the active weights.
94-
beta: A float value, representing the beta value from the paper. Defaults
95-
to 0.0.
96-
{{base_optimizer_keyword_args}}
77+
learning_rate: A `Tensor`, floating point value, a schedule that is a
78+
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
79+
that takes no arguments and returns the actual value to use. The
80+
learning rate. Defaults to `0.001`.
81+
learning_rate_power: A float value, must be less or equal to zero.
82+
Controls how the learning rate decreases during training. Use zero
83+
for a fixed learning rate.
84+
initial_accumulator_value: The starting value for accumulators. Only
85+
zero or positive values are allowed.
86+
l1_regularization_strength: A float value, must be greater than or equal
87+
to zero. Defaults to `0.0`.
88+
l2_regularization_strength: A float value, must be greater than or equal
89+
to zero. Defaults to `0.0`.
90+
l2_shrinkage_regularization_strength: A float value, must be greater
91+
than or equal to zero. This differs from L2 above in that the L2
92+
above is a stabilization penalty, whereas this L2 shrinkage is a
93+
magnitude penalty. When input is sparse shrinkage will only happen
94+
on the active weights.
95+
beta: A float value, representing the beta value from the paper.
96+
Defaults to 0.0.
97+
{{base_optimizer_keyword_args}}
9798
"""
9899

99100
def __init__(

keras/optimizers/lion.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -40,22 +40,22 @@ class Lion(optimizer.Optimizer):
4040
similar strength (lr * wd).
4141
4242
Args:
43-
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
44-
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
45-
that takes no arguments and returns the actual value to use. The
46-
learning rate. Defaults to 0.0001.
47-
beta_1: A float value or a constant float tensor, or a callable
48-
that takes no arguments and returns the actual value to use. The rate
49-
to combine the current gradient and the 1st moment estimate.
50-
beta_2: A float value or a constant float tensor, or a callable
51-
that takes no arguments and returns the actual value to use. The
52-
exponential decay rate for the 1st moment estimate.
53-
{{base_optimizer_keyword_args}}
43+
learning_rate: A `tf.Tensor`, floating point value, a schedule that is a
44+
`tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable
45+
that takes no arguments and returns the actual value to use. The
46+
learning rate. Defaults to 0.0001.
47+
beta_1: A float value or a constant float tensor, or a callable
48+
that takes no arguments and returns the actual value to use. The
49+
rate to combine the current gradient and the 1st moment estimate.
50+
beta_2: A float value or a constant float tensor, or a callable
51+
that takes no arguments and returns the actual value to use. The
52+
exponential decay rate for the 1st moment estimate.
53+
{{base_optimizer_keyword_args}}
5454
5555
References:
56-
- [Chen et al., 2023](http://arxiv.org/abs/2302.06675)
57-
- [Authors' implementation](
58-
http://github.com/google/automl/tree/master/lion)
56+
- [Chen et al., 2023](http://arxiv.org/abs/2302.06675)
57+
- [Authors' implementation](
58+
http://github.com/google/automl/tree/master/lion)
5959
6060
"""
6161

@@ -102,7 +102,7 @@ def build(self, var_list):
102102
Lion optimizer has one variable `momentums`.
103103
104104
Args:
105-
var_list: list of model variables to build Lion variables on.
105+
var_list: list of model variables to build Lion variables on.
106106
"""
107107
super().build(var_list)
108108
if hasattr(self, "_built") and self._built:

0 commit comments

Comments
 (0)