Welcome to ESTorch’s documentation!¶
-
estorch.estorch.
rank_transformation
(rewards)[source]¶ Applies rank transformation to the returns.
Examples
>>> rewards = [-123, -50, 3, -5, 20, 10, 100] >>> estorch.rank_transformation(rewards) array([-0.5 , -0.33333333, 0. , -0.16666667, 0.33333333, 0.16666667, 0.5 ])
-
class
estorch.
VirtualBatchNorm
(num_features, eps=1e-05)[source]¶ Applies Virtual Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in paper Improved Techniques for Training GANs: https://arxiv.org/abs/1606.03498
\[y = \frac{x - \mathrm{E}[x_\text{ref}]}{ \sqrt{\mathrm{Var}[x_\text{ref}] + \epsilon}} * \gamma + \beta\]VirtualBatchNorm requires two forward passes. First one is to calculate mean and variance over a reference batch and second is to calculate the actual output.
Parameters: - num_features – \(C\) from an expected input of size \((N, C, H, W)\)
- eps – a value added to the denominator for numerical stability. Default: 1e-5
-
class
estorch.
ES
(policy, agent, optimizer, population_size, sigma=0.01, device=device(type='cpu'), policy_kwargs={}, agent_kwargs={}, optimizer_kwargs={})[source]¶ Classic Evolution Strategy Algorithm. It optimizes given policy for the max reward return. For example usage refer to https://github.com/goktug97/estorch/blob/master/examples/cartpole_es.py
\[\nabla_{\theta} \mathbb{E}_{\epsilon \sim N(0, I)} F(\theta+\sigma \epsilon)=\frac{1}{\sigma} \mathbb{E}_{\epsilon \sim N(0, I)}\{F(\theta+\sigma \epsilon) \epsilon\}\]- Evolution Strategies as a Scalable Alternative to Reinforcement Learning: https://arxiv.org/abs/1703.03864
Parameters: - policy – PyTorch Module. Should be passed as a
class
. - agent – Policy will be optimized to maximize the output of this
class’s rollout function. For an example agent class refer to;
https://github.com/goktug97/estorch/blob/master/examples/cartpole_es.py
Should be passed as a
class
. - optimizer – Optimizer that will be used to update parameters of the policy.
Any PyTorch optimizer can be used. Should be passed as a
class
. - population_size –
Population size of the evolution strategy.
Note
if you are using multiprocessing make sure
population_size
is multiple ofn_proc
- sigma – Standart Deviation to use while sampling the generation from the policy.
- device –
Torch device
Note
For every process a target network will be created to use during rollout. That is why I don’t recommend use of
torch.device('cuda')
. - policy_kwargs – This dictionary of arguments will passed to the policy during initialization.
- agent_kwargs – This dictionary of arguments will passed to the agent during initialization.
- optimizer_kwargs – This dictionary of arguments will passed to the optimizer during initialization.
Variables: - policy – Each step this policy is optimized. Only in master process.
- optimizer – Optimizer that is used to optimize the
policy
. Only in master process. - agent – Used for rollout in each processes.
- n_parameters – Number of trainable parameters of the
policy
. - best_reward – Best reward achived during the training.
- episode_reward – Reward of the policy after the optimization.
- best_policy_dict – PyTorch
state_dict
of the policy with the highest reward. - population_returns – Current population’s rewards.
- population_parameters – Parameter vectors of the current population.
-
log
()[source]¶ log
function is called after every optimization step. This function can be used to interract with the model during the training. By default its contents are:print(f'Step: {self.step}') print(f'Episode Reward: {self.episode_reward}') print(f'Max Population Reward: {np.max(self.population_returns)}') print(f'Max Reward: {self.best_reward}')
For example usage; https://github.com/goktug97/estorch/blob/master/examples/early_stopping.py
-
train
(n_steps, n_proc=1, hwthread=False, hostfile=None)[source]¶ Train Evolution Strategy algorithm for n_steps in n_proc processes.
Note
This function can not be called more than once in the same script if
n_proc
is set to more than 1 because it executes the same scriptn_proc
times which means it will start from the beginning of the script everytime.Parameters: - n_steps – Number of training steps.
- n_proc – Number of processes. Processes are used for rollouts.
- hwthread – A boolean value, if
True
use hardware threads as independent cpus. Some processors are hyperthreaded which means 1 CPU core is splitted into multiple threads. For example in Linux, nproc command returns number of cores and if that number doesn’t work here set hwthread toTrue
and try again. - hostfile – If set,
n_proc
andhwthread
will be ignored and thehostfile
will be used to initialize multiprocessing. For more information visit https://github.com/open-mpi/ompi/blob/9c0a2bb2d675583934efd5e6e22ce8245dd5554c/README#L1904
Raises: RuntimeError
– train function can not be called more than once.
-
class
estorch.
NS_ES
(policy, agent, optimizer, population_size, sigma=0.01, meta_population_size=3, k=10, device=device(type='cpu'), policy_kwargs={}, agent_kwargs={}, optimizer_kwargs={})[source]¶ Novelty Search Evolution Strategy Algorithm. It optimizes given policy for the max novelty return. For example usage refer to https://github.com/goktug97/estorch/blob/master/examples/nsra_es.py This class is inherited from the
ES
so every function that is described in theES
can be used in this class too.\[\nabla_{\theta_{t}} \mathbb{E}_{e \sim N(0, I)}\left[N\left(\theta_{t}+\sigma \epsilon, A\right) | A\right] \approx \frac{1}{n \sigma} \sum_{i=1}^{n} N\left(\theta_{t}^{i}, A\right) \epsilon_{i}\]Where \(N\left(\theta_{t}^{i}, A\right)\) is calculated as;
\[N(\theta, A)=N\left(b\left(\pi_{\theta}\right), A\right)=\frac{1}{|S|} \sum_{j \in S}\left\|b\left(\pi_{\theta}\right)-b\left(\pi_{j}\right)\right\|_{2}\]\[S=k N N\left(b\left(\pi_{\theta}\right), A\right)\]\[=\left\{b\left(\pi_{1}\right), b\left(\pi_{2}\right), \ldots, b\left(\pi_{k}\right)\right\}\]- Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents http://papers.nips.cc/paper/7750-improving-exploration-in-evolution-strategies-for-deep-reinforcement-learning-via-a-population-of-novelty-seeking-agents.pdf
Parameters: - policy – PyTorch Module. Should be passed as a
class
. - agent – Policy will be optimized to maximize the output of this
class’s rollout function. For an example agent class refer to;
https://github.com/goktug97/estorch/blob/master/examples/cartpole_es.py
Should be passed as a
class
. - optimizer – Optimizer that will be used to update parameters of the policy.
Any PyTorch optimizer can be used. Should be passed as a
class
. - population_size –
Population size of the evolution strategy.
Note
if you are using multiprocessing make sure
population_size
is multiple ofn_proc
- sigma – Standart Deviation to use while sampling the generation from the policy.
- meta_population_size –
Instead of one policy a meta population of policies are optimized during training. Each step a policy is chosen from the meta population. Probability of each policy is calculated as;
\[P\left(\theta^{m}\right)=\frac{N\left(\theta^{m}, A\right)}{\sum_{j=1}^{M} N\left(\theta^{3}, A\right)}\] - k – Number of nearest neigbhours used in the calculation of the novelty.
- device –
Torch device
Note
For every process a target network will be created to use during rollout. That is why I don’t recommend use of
torch.device('cuda')
. - policy_kwargs – This dictionary of arguments will passed to the policy during initialization.
- agent_kwargs – This dictionary of arguments will passed to the agent during initialization.
- optimizer_kwargs – This dictionary of arguments will passed to the optimizer during initialization.
Variables: - meta_population – List of (policy, optimizer) tuples.
- idx – Selected (policy, optimizer) tuple index in the current step.
- agent – Used for rollout in each processes.
- n_parameters – Number of trainable parameters.
- best_reward – Best reward achived during the training.
- episode_reward – Reward of the chosen policy after the optimization.
- best_policy_dict – PyTorch
state_dict
of the policy with the highest reward. - population_returns – List of (novelty, reward) tuple of the current population.
- population_parameters – Parameter vectors of the current population that sampled from the chosen policy.
-
class
estorch.
NSR_ES
(policy, agent, optimizer, population_size, sigma=0.01, meta_population_size=3, k=10, device=device(type='cpu'), policy_kwargs={}, agent_kwargs={}, optimizer_kwargs={})[source]¶ Quality Diversity Evolution Strategy Algorithm. It optimizes given policy for the max avarage of novelty and reward return. For example usage refer to https://github.com/goktug97/estorch/blob/master/examples/nsra_es.py This class is inherited from the
NS_ES
which inherits fromES
so every function that is described in theES
can be used in this class too.\[\theta_{t+1}^{m} \leftarrow \theta_{t}^{m}+\alpha \frac{1}{n \sigma} \sum_{i=1}^{n} \frac{f\left(\theta_{t}^{i, m}\right)+N\left(\theta_{t}^{i, m}, A\right)}{2} \epsilon_{i}\]- Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents http://papers.nips.cc/paper/7750-improving-exploration-in-evolution-strategies-for-deep-reinforcement-learning-via-a-population-of-novelty-seeking-agents.pdf
Parameters: - policy – PyTorch Module. Should be passed as a
class
. - agent – Policy will be optimized to maximize the output of this
class’s rollout function. For an example agent class refer to;
https://github.com/goktug97/estorch/blob/master/examples/cartpole_es.py
Should be passed as a
class
. - optimizer – Optimizer that will be used to update parameters of the policy.
Any PyTorch optimizer can be used. Should be passed as a
class
. - population_size –
Population size of the evolution strategy.
Note
if you are using multiprocessing make sure
population_size
is multiple ofn_proc
- sigma – Standart Deviation to use while sampling the generation from the policy.
- meta_population_size –
Instead of one policy a meta population of policies are optimized during training. Each step a policy is chosen from the meta population. Probability of each policy is calculated as;
\[P\left(\theta^{m}\right)=\frac{N\left(\theta^{m}, A\right)}{\sum_{j=1}^{M} N\left(\theta^{3}, A\right)}\] - k – Number of nearest neigbhours used in the calculation of the novelty.
- device –
Torch device
Note
For every process a target network will be created to use during rollout. That is why I don’t recommend use of
torch.device('cuda')
. - policy_kwargs – This dictionary of arguments will passed to the policy during initialization.
- agent_kwargs – This dictionary of arguments will passed to the agent during initialization.
- optimizer_kwargs – This dictionary of arguments will passed to the optimizer during initialization.
Variables: - meta_population – List of (policy, optimizer) tuples.
- idx – Selected (policy, optimizer) tuple index in the current step.
- agent – Used for rollout in each processes.
- n_parameters – Number of trainable parameters.
- best_reward – Best reward achived during the training.
- episode_reward – Reward of the chosen policy after the optimization.
- best_policy_dict – PyTorch
state_dict
of the policy with the highest reward. - population_returns – List of (novelty, reward) tuple of the current population.
- population_parameters – Parameter vectors of the current population that sampled from the chosen policy.
-
class
estorch.
NSRA_ES
(policy, agent, optimizer, population_size, sigma=0.01, meta_population_size=3, k=10, min_weight=0.0, weight_t=50, weight_delta=0.05, device=device(type='cpu'), policy_kwargs={}, agent_kwargs={}, optimizer_kwargs={})[source]¶ Quality Diversity Evolution Strategy Algorithm. It optimizes given policy for the max weighted avarage of novelty and reward return. For example usage refer to https://github.com/goktug97/estorch/blob/master/examples/nsra_es.py This class is inherited from the
NS_ES
which inherits fromES
so every function that is described in theES
can be used in this class too.\[\theta_{t+1}^{m} \leftarrow \theta_{t}^{m}+\alpha \frac{1}{n \sigma} \sum_{i=1}^{n} w f\left(\theta_{t}^{i, m}\right) \epsilon_{i}+(1-w) N\left(\theta_{t}^{i, m}, A\right) \epsilon_{i}\]- Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents http://papers.nips.cc/paper/7750-improving-exploration-in-evolution-strategies-for-deep-reinforcement-learning-via-a-population-of-novelty-seeking-agents.pdf
Parameters: - policy – PyTorch Module. Should be passed as a
class
. - agent – Policy will be optimized to maximize the output of this
class’s rollout function. For an example agent class refer to;
https://github.com/goktug97/estorch/blob/master/examples/cartpole_es.py
Should be passed as a
class
. - optimizer – Optimizer that will be used to update parameters of the policy.
Any PyTorch optimizer can be used. Should be passed as a
class
. - population_size –
Population size of the evolution strategy.
Note
if you are using multiprocessing make sure
population_size
is multiple ofn_proc
- sigma – Standart Deviation to use while sampling the generation from the policy.
- meta_population_size –
Instead of one policy a meta population of policies are optimized during training. Each step a policy is chosen from the meta population. Probability of each policy is calculated as;
\[P\left(\theta^{m}\right)=\frac{N\left(\theta^{m}, A\right)}{\sum_{j=1}^{M} N\left(\theta^{3}, A\right)}\] - k – Number of nearest neigbhours used in the calculation of the novelty.
- min_weight,weight_t,weight_delta – If the max reward doesn’t improve for
weight_t
theweight
is lowered byweight_delta
amount. It can’t get lower thanmin_weight
. - device –
Torch device
Note
For every process a target network will be created to use during rollout. That is why I don’t recommend use of
torch.device('cuda')
. - policy_kwargs – This dictionary of arguments will passed to the policy during initialization.
- agent_kwargs – This dictionary of arguments will passed to the agent during initialization.
- optimizer_kwargs – This dictionary of arguments will passed to the optimizer during initialization.
Variables: - meta_population – List of (policy, optimizer) tuples.
- idx – Selected (policy, optimizer) tuple index in the current step.
- agent – Used for rollout in each processes.
- n_parameters – Number of trainable parameters.
- best_reward – Best reward achived during the training.
- episode_reward – Reward of the chosen policy after the optimization.
- best_policy_dict – PyTorch
state_dict
of the policy with the highest reward. - population_returns – List of (novelty, reward) tuple of the current population.
- population_parameters – Parameter vectors of the current population that sampled from the chosen policy.