IQN by qgallouedec · Pull Request #139 · Stable-Baselines-Team/stable-baselines3-contrib

qgallouedec · 2023-01-26T15:46:39Z

Description

Context

I have raised an issue to propose this change (required)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

Note: we are using a maximum length of 127 characters per line

qgallouedec · 2023-01-27T12:38:10Z

Results comparison

Current implementation	Reference
(6 seeds)	from https://github.com/toshikwa/fqf-iqn-qrdqn.pytorch (1 seed, same parameters)
	(2 seeds, same parameters)
	https://di-engine-docs.readthedocs.io/en/latest/12_policies/iqn.html#benchmark

emrul · 2023-02-25T12:53:23Z

@qgallouedec Thank you for adding this. I wanted to report that for me it works well and I was able to adapt it to implement the paper Self-Imitation Advantage Learning
. I'm not sure how useful it is for you but I'm happy to share my modifications to add SAIL-IQN to your IQN implementation (I don't have the resources right now to submit this as a separate PR):

New replay buffer to store discounted returns (G):

import warnings
import itertools
from typing import Generator, Optional, Union, NamedTuple, List, Dict, Any
import numpy as np
import torch as th
from stable_baselines3.common.type_aliases import ReplayBufferSamples, RolloutBufferSamples
from stable_baselines3.common.buffers import ReplayBuffer
from stable_baselines3.common.vec_env import VecNormalize
from gymnasium import spaces

PLACEHOLDER_RETURN_VALUE = np.finfo(np.float32).min

class SAILReplayBufferSamples(NamedTuple):
    observations: th.Tensor
    actions: th.Tensor
    next_observations: th.Tensor
    dones: th.Tensor
    rewards: th.Tensor
    returns: th.Tensor

class SAILReplayBuffer(ReplayBuffer):
    def __init__(
        self,
        buffer_size: int,
        observation_space: spaces.Space,
        action_space: spaces.Space,
        device: Union[th.device, str] = "cpu",
        n_envs: int = 1,
        optimize_memory_usage: bool = False,
        gamma: float = 0.99
    ):
        super().__init__(buffer_size, observation_space, action_space, device, n_envs, optimize_memory_usage)
        ## TODO: Haven't looked at supporting optimize_memory_usage true yet
        # assert optimize_memory_usage == False, 'optimize_memory_usage does not work with SAIL currently'
        self.gamma = gamma
        self.returns = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)
        # For each env store where the episode starts (0 for all envs at the beginning)
        # but will vary as each episode can end at a different point
        self.episode_start_indices = np.zeros(self.n_envs, dtype=np.int32)

    def update_episodic_return(self, completed_env_indices: np.ndarray, episode_end_idx: int):
        # completed_env_indices - indices of envs with completed episode
        # For all episodes that have ended episode_end_pos will contain the end position though
        # it may be infrequent for multiple episodes to end at the same position
        for env_idx in completed_env_indices:
            # episode_start_idx can be > episode_end_idx due to buffer wrap-around
            episode_start_idx = self.episode_start_indices[env_idx]
            G = 0
            x = 0
            i = episode_end_idx # index used to calculate discounted return
            if episode_start_idx < episode_end_idx:
                max_episode_steps = episode_end_idx - episode_start_idx
            else:
                # This won't be accurate if we've wrapped around more than once but we should somewhere require
                # max_episode_steps to be less than buffer size to prevent that from happening.
                max_episode_steps = self.buffer_size - episode_end_idx + episode_start_idx
            while x <= max_episode_steps:
                G = self.rewards[i, env_idx] + self.gamma * G
                self.returns[i, env_idx] = G
                i = (i - 1) % self.buffer_size
                x += 1
                pass

        pass
    def add(
        self,
        obs: np.ndarray,
        next_obs: np.ndarray,
        action: np.ndarray,
        reward: np.ndarray,
        done: np.ndarray,
        infos: List[Dict[str, Any]],
    ) -> None:
        # we want position before it gets updated
        pos = self.pos
        super().add(obs=obs, next_obs=next_obs, action=action, reward=reward, done=done, infos=infos)
        self.returns[pos] = np.repeat(PLACEHOLDER_RETURN_VALUE, repeats=self.n_envs)
        if np.any(done):
            # Only use dones that are not due to timeouts
            true_dones = done * (1 - self.timeouts[pos])
            if np.any(true_dones):
                self.update_episodic_return(np.flatnonzero(true_dones), pos)
            # Update episode start indices (whether due to timeout or not) to the current start index
            # of the next episode (self.pos)
            np.put_along_axis(self.episode_start_indices, np.flatnonzero(done), self.pos, axis=0)
            pass
        return

    def sample(self, batch_size: int, env: Optional[VecNormalize] = None) -> SAILReplayBufferSamples:
        # noinspection PyTypeChecker
        return super().sample(batch_size=batch_size, env=env)

    def _get_samples(self, batch_inds: np.ndarray, env: Optional[VecNormalize] = None) -> SAILReplayBufferSamples:
        # Sample randomly the env idx
        env_indices = np.random.randint(0, high=self.n_envs, size=(len(batch_inds),))

        if self.optimize_memory_usage:
            next_obs = self._normalize_obs(self.observations[(batch_inds + 1) % self.buffer_size, env_indices, :], env)
        else:
            next_obs = self._normalize_obs(self.next_observations[batch_inds, env_indices, :], env)

        data = (
            self._normalize_obs(self.observations[batch_inds, env_indices, :], env),
            self.actions[batch_inds, env_indices, :],
            next_obs,
            # Only use dones that are not due to timeouts
            # deactivated by default (timeouts is initialized as an array of False)
            (self.dones[batch_inds, env_indices] * (1 - self.timeouts[batch_inds, env_indices])).reshape(-1, 1),
            self._normalize_reward(self.rewards[batch_inds, env_indices].reshape(-1, 1), env),
            self.returns[batch_inds, env_indices]
        )
        return SAILReplayBufferSamples(*tuple(map(self.to_torch, data)))

and updated training loop:

        # Sample replay buffer
        replay_data = self.replay_buffer.sample(batch_size, env=self._vec_normalize_env)
        with th.no_grad():
            # BEGIN - SAIL addition.
            rewards = replay_data.rewards
            # Ref to https://github.com/google-research/google-research/blob/master/sail_rl/agents/sail_iqn.py
            # Calculates **current state** action-values.
            # Shape: batch_size x n_quantiles x num_actions.
            replay_target_net_outputs = self.quantile_net_target(replay_data.observations, self.n_quantiles)

            # Shape: batch_size x num_actions
            replay_target_q_values = replay_target_net_outputs.mean(dim=1)

            replay_action_one_hot = th.nn.functional.one_hot(replay_data.actions.squeeze(-1), self.action_space.n).type(th.float32)
            replay_target_q = th.max(replay_target_q_values, dim=1).values
            replay_target_q_al = th.sum(replay_action_one_hot * replay_target_q_values, dim=1)
            comp_value = th.max(replay_target_q_al, replay_data.returns)

            if self.clip > 0.:
                sil_bonus = self.alpha * th.clamp(comp_value - replay_target_q, min=-self.clip, max=self.clip)
            else:
                sil_bonus = self.alpha * (comp_value - replay_target_q)

            rewards = rewards + sil_bonus.unsqueeze(-1)
            # END - SAIL addition

            # Compute the quantiles of next observation
            next_quantiles = self.quantile_net_target(replay_data.next_observations, self.n_quantiles)

            # Shape of next_quantiles:
            # batch_size x n_quantiles x num_actions.
            # e.g. if num_actions is 2, it might look something like this:
            # Vals for Quantile .2  Vals for Quantile .4  Vals for Quantile .6
            #    [[0.1, 0.5],         [0.15, -0.3],          [0.15, -0.2]]
            # Q-values = [(0.1 + 0.15 + 0.15)/3, (0.5 + 0.15 + -0.2)/3].

            # Compute the greedy actions which maximize the next Q values
            next_greedy_actions = next_quantiles.mean(dim=1, keepdim=True).argmax(dim=2, keepdim=True)

            # Make "num_tau_prime_samples" copies of actions, and reshape to (batch_size, num_tau_prime_samples, 1)
            next_greedy_actions = next_greedy_actions.expand(batch_size, self.num_tau_prime_samples, 1)

            # Compute the quantiles of next observation, but with another number of tau samples
            next_quantiles = self.quantile_net_target(replay_data.next_observations, self.num_tau_prime_samples)

            # Follow greedy policy: use the one with the highest Q values
            next_quantiles = next_quantiles.gather(dim=2, index=next_greedy_actions).squeeze(dim=2)

            # 1-step TD target
            target_quantiles = rewards + (1 - replay_data.dones) * self.gamma * next_quantiles

        # Get current quantile estimates
        current_quantiles = self.quantile_net(replay_data.observations, self.num_tau_samples)

        # Make "num_tau_samples" copies of actions, and reshape to (batch_size, num_tau_samples, 1).
        actions = replay_data.actions[..., None].long().expand(batch_size, self.num_tau_samples, 1)

        # Retrieve the quantiles for the actions from the replay buffer
        current_quantiles = th.gather(current_quantiles, dim=2, index=actions).squeeze(dim=2)

        # Compute Quantile Huber loss, summing over a quantile dimension as in the paper.
        loss = quantile_huber_loss(current_quantiles, target_quantiles, sum_over_quantiles=True)
        return loss

The extra parameters alpha and clip are defaulted to 0.9 and 1.0.

I found immediately that SAIL-IQN performs nicely on sparse rewards so am quite happy with my initial results but by no means has my testing been thorough.

qgallouedec · 2023-02-26T08:52:40Z

Thanks for your feedback @emrul! This PR is still draft because I can't replicate exactly the results of the paper for Qbert. I don't know if it's a hyperparameter problem or something else, I'm still looking.

I think SIL (and probably maybe SAIL) would fit in SB3-contrib. However, it would be best to discuss it in a dedicated issue. I'll open it right away.

emrul · 2023-02-26T21:45:17Z

Thanks @qgallouedec - I didn't know there's a reproduction issue, I will look into this also - I compared your implementation with the Dopamine one and the Medipexel/pytorch port of that and it looked quite different. I will dig in to see where they differ and feedback if I find anything to assist.

qgallouedec added 3 commits January 26, 2023 16:44

IQN implementation

ddef1de

Add example in documentation

1e16cc2

Add IQN tests

42f0e46

qgallouedec changed the title ~~Feat/iqn~~ IQN Jan 26, 2023

Fix docstring [ci skip]

f80ee50

qgallouedec added 2 commits January 27, 2023 14:12

Mypy compliance

a434ba9

add iqn to index.rst

da37cbc

qgallouedec mentioned this pull request Feb 26, 2023

SIL #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

IQN#139

IQN#139
qgallouedec wants to merge 6 commits intomasterfrom
feat/iqn

qgallouedec commented Jan 26, 2023 •

edited

Loading

Uh oh!

qgallouedec commented Jan 27, 2023 •

edited

Loading

Uh oh!

emrul commented Feb 25, 2023 •

edited

Loading

Uh oh!

qgallouedec commented Feb 26, 2023

Uh oh!

emrul commented Feb 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

qgallouedec commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Types of changes

Checklist:

Uh oh!

qgallouedec commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results comparison

Uh oh!

emrul commented Feb 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Feb 26, 2023

Uh oh!

emrul commented Feb 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qgallouedec commented Jan 26, 2023 •

edited

Loading

qgallouedec commented Jan 27, 2023 •

edited

Loading

emrul commented Feb 25, 2023 •

edited

Loading