
SB3 PPO. Vectorized 16 env. ~ 9_000_000 timesteps of training. mean_reward=163 +/- 103 . Training for an additional 50_000_000 timesteps resulted in a worse reward when evaluating
28a0b97
{"mean_reward": 162.9, "std_reward": 102.90038872618508, "is_deterministic": true, "n_eval_episodes": 10, "eval_datetime": "2023-01-13T08:25:01.744757"} |