TD Learning Minefield

Watch an agent learn to navigate a minefield using Temporal Difference learning. Q-values update in real-time as the agent explores.

🤖

💣

🎯

Speed:Medium

💣 Mine

🎯 Goal

🤖 Agent

High Q

Low Q

Training Stats

Episode0

Steps0

Successes0

Deaths0

Success Rate (last 100)0%

Hyperparameters

Learning Rate (α)0.30

Discount (γ)0.95

Exploration (ε)0.20

Visualization

Show Q-value heatmap

Show policy arrows

View Source Code

The Problem

An agent starts at the top-left corner and must navigate to the goal at the bottom-right, avoiding randomly placed mines. The agent has no prior knowledge of the environment and must learn through trial and error.

Each step costs -1, hitting a mine gives -100, and reaching the goal rewards +100. The agent learns which actions lead to high long-term rewards.

Temporal Difference Learning

Q(s,a) ← Q(s,a) + α × [r + γ × max(Q(s',a')) - Q(s,a)]

α (Learning Rate)

How quickly to update Q-values. Higher means faster learning but more instability.

γ (Discount)

How much to value future rewards vs immediate. Higher means more forward-thinking.

ε (Exploration)

Probability of taking a random action instead of the best known action.

What to Observe

Q-values (green shading) gradually spread from the goal as the agent learns
Cells near mines develop negative Q-values (red shading)
Policy arrows show the best action at each cell, forming a path to the goal
Success rate improves as the agent learns to avoid mines
The highlighted cell shows the most recently updated Q-value