Lunar Lander QD Optimization
Explore a CVT Archive of diverse landing strategies discovered through Quality-Diversity optimization. Each point represents an elite solution with unique behavioral characteristics.
Drag to rotate, scroll to zoom, click cells to inspect
Archive Stats
Selected Elite
Click on a cell in the archive to view its details
Behavior Space
Sample Elite Landings
Different elites from the archive exhibit diverse landing behaviors while all achieving successful landings.

Environment Seed: 321
Stable descent with minimal horizontal drift

Environment Seed: 87
Aggressive correction under different wind conditions
The Problem
Traditional reinforcement learning finds a single optimal policy. But what if we want to discover many different ways to land successfully? Quality-Diversity (QD) optimization finds a diverse collection of high-performing solutions.
In the LunarLander environment, the agent must land safely on a platform. Different landing behaviors (fast vs slow impact, left vs right position) can all be successful. QD discovers the full spectrum of viable strategies.
Quality-Diversity Optimization
CVT Archive
A Centroidal Voronoi Tessellation partitions the behavior space into cells. Each cell stores the best solution found for that behavioral niche.
Evolution Strategies
Multiple emitters generate candidate solutions by mutating existing elites. Solutions compete to fill cells based on both quality and diversity.
Behavior Descriptors
Three measures capture landing behavior: vertical impact velocity, horizontal position, and horizontal velocity at touchdown.
QD Score
Success is measured by both coverage (how much of the behavior space is filled) and quality (the sum of all elite fitnesses).
Archive Insights
Exploring the archive reveals intuitive patterns about what makes a successful landing:
Soft Landings Score Higher
Elites with lower vertical impact velocity (closer to 0) consistently achieve higher fitness. This makes physical sense—gentler touchdowns avoid crash penalties and earn landing bonuses. The gradient from purple to yellow across the Y-velocity axis visualizes this directly.
Centered Landings Are Rewarded
Solutions landing near the center of the pad (X-position ≈ 0) tend to have higher fitness. The environment rewards precision—landing too far left or right wastes fuel on correction maneuvers and risks missing the target entirely.
Horizontal Stability Matters
Low horizontal velocity at impact correlates with higher rewards. A lander drifting sideways on touchdown is unstable and may tip over. The best elites approach the pad with minimal lateral movement.
Diversity Has Value
Despite the optimal behavior being a slow, centered landing, QD finds successful policies across the entire behavior space. These "suboptimal but viable" solutions are valuable—they demonstrate robustness and could be used when environmental conditions change.
Project Configuration
| Parameter | Value |
|---|---|
| Archive Cells | 500 |
| Epochs | 1000 |
| Emitters | 5 |
| Batch Size | 30 |
| Environment | LunarLander-v3 (wind enabled) |