You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To develop a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm to solve a Multi-Agent Environment (i.e., Vehicle Scheduling Environment) and Simple Adversary: OpenAI Multi-Agent particle environment.
Multi-Agent Environment
Two cars in a 4x4 Grid-world environment
1st car – Goal - To reach top right of the environment
2nd car – Goal - To reach top left of the environment
Positive reward - based on the distance between the closest agent to the target landmark
Negative reward – based on the distance between the adversary to the target landmark
For adversary:
Positive reward – based on the distance between the adversary to target landmark
Implementation:
Implemented Q-learning and MADDPG on both Vehicle Scheduling and Simple Adversary Environments
MADDPG:
Every Agent has
Actor Network:
Inputs: States, Actions
Outputs: Probabilities
Critic Network:
Inputs: States, Actions
Outputs: Q values
To avoid running targets (i.e. freeze weights) target networks are used
Target Actor Network (i.e. performed soft updates)
Target Critic Network (i.e. performed soft updates)
Improved Version of MADDPG
I developed an improved version of MADDPG, where I have used the ε-greedy approach even after applying noise to actions chosen from the deterministic policy.
Observations:
Q learning is not working well for the Vehicle Scheduling environment.
The MADDPG algorithm is working better when compared to the Q-learning algorithm.
Proper attention should be given while implementing the MADDPG algorithm since it may lead to over-estimation of the Q-value using the Critic network.
MADDPG is working well for a continuous action-state value environment (i.e., Simple Adversary)
About
Developed a Multi-Agent DDPG to solve Vehicle Scheduling problem.