Solving MountainCarContinuous-v0 with DDPG and REINFORCE Approaches

This project implements two reinforcement learning algorithms to solve the MountainCarContinuous-v0 environment: Deep Deterministic Policy Gradient (DDPG) and REINFORCE (Monte Carlo Policy Gradient). The project provides a comprehensive comparison of these approaches in terms of learning efficiency, stability, and performance.

Note: Code for key components has been withheld to prevent misuse or plagiarism. Please contact me for collaboration or more details.

Environment Description

MountainCarContinuous-v0 is a classic control problem in reinforcement learning where an under-powered car must drive up a steep mountain. The car's engine is not strong enough to scale the mountain in a single pass, so the agent must learn to build momentum by driving back and forth.

State Space

2-dimensional continuous state: position (-1.2 to 0.6) and velocity (-0.07 to 0.07)

Action Space

1-dimensional continuous action: force applied to the car (-1.0 to 1.0)

Reward Structure

Reward of +100 is provided for reaching the goal (position > 0.45)
Small negative reward proportional to the squared action (energy penalty)
Episode terminates when the goal is reached or after 999 steps

Solving Criteria

The environment is considered solved when the agent achieves an average reward of 90 over 100 consecutive episodes

Implementation Details

The project implements two distinct reinforcement learning approaches:

1. DDPG (Deep Deterministic Policy Gradient)

DDPG is an actor-critic, model-free algorithm designed for continuous action spaces. It combines ideas from DQN and policy gradients to handle the continuous action domain effectively.

Components:

Actor Network: Deterministic policy that maps states to specific actions
- Architecture: 2 → 64 → 64 → 1 with ReLU activations and tanh output
Critic Network: Evaluates the state-action pairs
- Architecture: (2+1) → 64 → 64 → 1 with ReLU activations
Replay Buffer: Stores transitions (state, action, reward, next_state) for experience replay
- Capacity: 10,000 transitions
Target Networks: Separate networks for stable learning
- Soft update rate (τ): 0.01
Exploration Strategy: Ornstein-Uhlenbeck process for temporally correlated noise
- Parameters: θ = 0.15, σ = 0.3 (decaying to 0.3)

2. REINFORCE (Monte Carlo Policy Gradient)

REINFORCE is a policy gradient method that learns a stochastic policy by directly maximizing expected cumulative rewards.

Components:

Policy Network: Stochastic policy that outputs a normal distribution over actions
- Architecture: 2 → 16 → 16 → 2 (mean and log_std) with ReLU activations
Optimization: Gradient ascent on the expected return
- Learning rate: 5e-4
Variance Reduction: Baseline subtraction (normalizing returns)
Exploration: Inherent in the stochastic policy (sampling from normal distribution)

Key Algorithmic Differences

Feature	DDPG	REINFORCE
Policy Type	Deterministic	Stochastic
Action Selection	Direct action output + noise	Sample from distribution
Learning Method	TD Learning (bootstrapping)	Monte Carlo (complete episodes)
Sample Efficiency	Higher (replay buffer)	Lower (on-policy)
Update Frequency	After each step (if buffer has enough samples)	After each episode
Exploration	External noise process	Built into policy

Setup and Installation

Dependencies

The project requires the following main dependencies:

Python 3.9+
PyTorch 1.9+
OpenAI Gym 0.21.0
NumPy
Matplotlib

Environment Setup

Create a Conda Environment:

conda env create -f gym_torch_env1.yml
conda activate gym_env

Install Additional Dependencies (if needed):
```
pip install -r requirements.txt
```

Usage Instructions

Training Agents

To train the agents from scratch, use the following commands:

Train with DDPG

python train_MountainCarContinuous.py --agent ddpg

Train with REINFORCE

python train_MountainCarContinuous.py --agent reinforce

The training process will save model weights automatically when the environment is solved or when the training completes.

Testing Trained Agents

To visualize the performance of trained agents:

Test DDPG Agent

python test_MountainCarContinuous.py -a ddpg

Test REINFORCE Agent

python test_MountainCarContinuous.py -a reinforce

This will run the agent in the environment with rendering enabled, allowing you to observe the learned behavior.

Project Structure

ddpg.py: Implementation of the DDPG algorithm
- ReplayBuffer: Memory for experience replay
- OUNoise: Ornstein-Uhlenbeck noise generator for exploration
- DDPGAgent: Main agent integrating actor-critic architecture
reinforce.py: Implementation of the REINFORCE algorithm
- Reinforce: Policy network with action distribution sampling
train_MountainCarContinuous.py: Training procedures for both algorithms
- Parameter settings, environment configuration
- Episode-by-episode training loops
- Metrics tracking and visualization
test_MountainCarContinuous.py: Evaluation scripts for trained agents
- Visualizes agent behavior in the environment
- Collects performance statistics
utils.py: Shared utility functions and network architectures
- Actor: Neural network for DDPG's deterministic policy
- Critic: Neural network for DDPG's Q-value estimation

Results and Performance Analysis

Both DDPG and REINFORCE successfully learned to solve the MountainCarContinuous-v0 environment, but with different learning characteristics and performance profiles.

Training Curves

DDPG Training Performance

Figure 1: Episode rewards and running average during DDPG training

REINFORCE Training Performance

Figure 2: Episode rewards and running average during REINFORCE training

Performance Metrics

Metric	DDPG	REINFORCE
Episodes to solve	~200-300	~500-600
Final average score	~95	~92
Training stability	Medium (high variance)	Low (very high variance)
Sample efficiency	Higher	Lower
Exploration effectiveness	Good (directed exploration)	Variable (stochastic)
Computational efficiency	Medium (requires target networks)	High (single network)

Key Observations

Learning Speed: DDPG generally learns faster than REINFORCE due to its off-policy nature and experience replay, which allows more efficient use of collected experiences.
Stability: REINFORCE shows higher variance in training, with more pronounced fluctuations in episode rewards. This is characteristic of Monte Carlo methods that rely on complete episode returns.
Exploration Strategy: DDPG's directed exploration through the Ornstein-Uhlenbeck process enables more consistent momentum-building strategies. REINFORCE's stochastic policy allows for diverse exploration but can be less directed.
Final Performance: Both algorithms achieve similar final performance when solved, with DDPG showing slightly more consistent behavior in the solved state.

Conclusion

This project demonstrates the application of two different reinforcement learning approaches to a continuous control problem. DDPG offers faster learning and more sample efficiency at the cost of implementation complexity, while REINFORCE provides simplicity but requires more training episodes to converge.

The results highlight the trade-offs between on-policy and off-policy methods, as well as between stochastic and deterministic policies in continuous action spaces. For the MountainCarContinuous environment, both approaches are viable, with DDPG having a slight edge in terms of learning speed and final performance stability.

License

This project is licensed under the GNU General Public License v2.0 (GPL-2.0) - see the LICENSE file for details.

This means you:

Are free to use, modify, and distribute the software
Can make derivative works
Must include the original copyright notice and license
Must make source code available when distributing the software
Must license derivative works under the same license

For more information about this license, visit GNU General Public License v2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
LICENSE		LICENSE
README.md		README.md
Training_DDPG.png		Training_DDPG.png
Training_REINFORCE.png		Training_REINFORCE.png
ddpg.py		ddpg.py
gym_torch_env1.yml		gym_torch_env1.yml
mountaincar_ddpg_actor_target_weights.pt		mountaincar_ddpg_actor_target_weights.pt
mountaincar_ddpg_actor_weights.pt		mountaincar_ddpg_actor_weights.pt
mountaincar_ddpg_critic_target_weights.pt		mountaincar_ddpg_critic_target_weights.pt
mountaincar_ddpg_critic_weights.pt		mountaincar_ddpg_critic_weights.pt
mountaincar_reinforce.pt		mountaincar_reinforce.pt
reinforce.py		reinforce.py
test_MountainCarContinuous.py		test_MountainCarContinuous.py
train_MountainCarContinuous.py		train_MountainCarContinuous.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Solving MountainCarContinuous-v0 with DDPG and REINFORCE Approaches

Environment Description

State Space

Action Space

Reward Structure

Solving Criteria

Implementation Details

1. DDPG (Deep Deterministic Policy Gradient)

2. REINFORCE (Monte Carlo Policy Gradient)

Key Algorithmic Differences

Setup and Installation

Dependencies

Environment Setup

Usage Instructions

Training Agents

Train with DDPG

Train with REINFORCE

Testing Trained Agents

Test DDPG Agent

Test REINFORCE Agent

Project Structure

Results and Performance Analysis

Training Curves

DDPG Training Performance

REINFORCE Training Performance

Performance Metrics

Key Observations

Conclusion

License

About

Uh oh!

Releases

Packages

Languages

License

muhabdullahd/MountainCarContinuous-DDPG-Reinforce

Folders and files

Latest commit

History

Repository files navigation

Solving MountainCarContinuous-v0 with DDPG and REINFORCE Approaches

Environment Description

State Space

Action Space

Reward Structure

Solving Criteria

Implementation Details

1. DDPG (Deep Deterministic Policy Gradient)

2. REINFORCE (Monte Carlo Policy Gradient)

Key Algorithmic Differences

Setup and Installation

Dependencies

Environment Setup

Usage Instructions

Training Agents

Train with DDPG

Train with REINFORCE

Testing Trained Agents

Test DDPG Agent

Test REINFORCE Agent

Project Structure

Results and Performance Analysis

Training Curves

DDPG Training Performance

REINFORCE Training Performance

Performance Metrics

Key Observations

Conclusion

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages