Back to ShowcaseView Source Code
TD Learning Minefield
Watch an agent learn to navigate a minefield using Temporal Difference learning. Q-values update in real-time as the agent explores.
🤖
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
🎯
Speed:Medium
💣 Mine
🎯 Goal
🤖 Agent
High Q
Low Q
Training Stats
Episode0
Steps0
Successes0
Deaths0
Success Rate (last 100)0%
Hyperparameters
Learning Rate (α)0.30
Discount (γ)0.95
Exploration (ε)0.20
Visualization
The Problem
An agent starts at the top-left corner and must navigate to the goal at the bottom-right, avoiding randomly placed mines. The agent has no prior knowledge of the environment and must learn through trial and error.
Each step costs -1, hitting a mine gives -100, and reaching the goal rewards +100. The agent learns which actions lead to high long-term rewards.
Temporal Difference Learning
Q(s,a) ← Q(s,a) + α × [r + γ × max(Q(s',a')) - Q(s,a)]α (Learning Rate)
How quickly to update Q-values. Higher means faster learning but more instability.
γ (Discount)
How much to value future rewards vs immediate. Higher means more forward-thinking.
ε (Exploration)
Probability of taking a random action instead of the best known action.
What to Observe
- Q-values (green shading) gradually spread from the goal as the agent learns
- Cells near mines develop negative Q-values (red shading)
- Policy arrows show the best action at each cell, forming a path to the goal
- Success rate improves as the agent learns to avoid mines
- The highlighted cell shows the most recently updated Q-value