Back to Showcase

TD Learning Minefield

Watch an agent learn to navigate a minefield using Temporal Difference learning. Q-values update in real-time as the agent explores.

🤖
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
💣
🎯
Speed:Medium
💣 Mine
🎯 Goal
🤖 Agent
High Q
Low Q

Training Stats

Episode0
Steps0
Successes0
Deaths0
Success Rate (last 100)0%

Hyperparameters

Learning Rate (α)0.30
Discount (γ)0.95
Exploration (ε)0.20

Visualization

View Source Code

The Problem

An agent starts at the top-left corner and must navigate to the goal at the bottom-right, avoiding randomly placed mines. The agent has no prior knowledge of the environment and must learn through trial and error.

Each step costs -1, hitting a mine gives -100, and reaching the goal rewards +100. The agent learns which actions lead to high long-term rewards.

Temporal Difference Learning

Q(s,a) ← Q(s,a) + α × [r + γ × max(Q(s',a')) - Q(s,a)]

α (Learning Rate)

How quickly to update Q-values. Higher means faster learning but more instability.

γ (Discount)

How much to value future rewards vs immediate. Higher means more forward-thinking.

ε (Exploration)

Probability of taking a random action instead of the best known action.

What to Observe

  • Q-values (green shading) gradually spread from the goal as the agent learns
  • Cells near mines develop negative Q-values (red shading)
  • Policy arrows show the best action at each cell, forming a path to the goal
  • Success rate improves as the agent learns to avoid mines
  • The highlighted cell shows the most recently updated Q-value