Project 5: Q-Table Reinforcement Learning Maze Solver
Reinforcement Learning to resolve the maze
The mazeBuilder.py aims at:
- Implementing a Q-learning algorithm to train an agent for solving mazes.
- Incorporating movement rewards and penalties, including incentives for exploration.
- Allowing configuration of training parameters such as learning rate (alpha), discount factor (gamma), and exploration rate (epsilon).
- Tracking of training metrics for performance evaluation and bug fixing.
- Supporting saving the trained Q-table model for future use and testing.
Below is a detailed overview of the key training parameters I used to build and refine my RL model. Each parameter is carefully tuned to strike a balance between exploration, exploitation, and overall performance.
| Parameter | Value | Description | How it works |
|---|---|---|---|
| EPISODES | 1'000-2'000 per maze in the test data set | Number of training episodes to run the agent. | Larger values typically allow more exploration and learning. |
| ALPHA (Learning Rate or LR) | 0.0125 | The learning rate determines how quickly the neural network updates its weights during training. A small learning rate, like 0.0005, makes sure that updates are gradual, leading to stable learning. | Higher values can speed up learning but risk overshooting. |
| GAMMA | 0.99 | Gamma controls how much the agent cares about future rewards. In the Q-learning update, the future reward is multiplied by gamma. So a reward received one step ahead is valued at 0.99 times its actual value, two steps ahead at 0.97², and so on. | Closer to 1.0 means the agent heavily favors long-term gains. |
| EPSILON_START | 1.0 | Epsilon determines how often the agent takes a random action (exploration) versus relying on its learned policy (exploitation). | Initially explores fully (random actions). |
| EPSILON_END | 0.15 | Minimum exploration rate after decay. | Agent still explores occasionally even at low epsilon. |
| EPSILON_DECAY | 0.9998 | Rate at which epsilon decreases each episode. | Slowly reduces exploration over time. |
| TEMPERATURE_START | 1.0 | Initial temperature for softmax action selection (if used). Temperature shapes the “smoothness” of the action probabilities when you’re following a softmax strategy. | Higher temperature means more random exploration initially. |
| TEMPERATURE_END | 0.15 | Final temperature after decay. | Encourages more deterministic choices later in training. |
| TEMPERATURE_DECAY | 0.9998 | Rate at which the temperature decreases each episode. | Gradually shifts behavior from exploratory to exploitative. |
| MAX_STEPS | 1'000 | Maximum steps the agent is allowed to take in a single episode. | Prevents episodes from running indefinitely. |
| MAX_STUCK_STEPS | 20 | Maximum consecutive steps with no movement the agent can take (being stuck). | Helps the agent avoid getting stuck in loops or dead ends. |
Below is a detailed overview of the reward structure I implemented during training. Each component reflects how the agent is incentivized or penalized for specific actions and outcomes, guiding it to navigate the maze more effectively and reach the goal.
| Parameter | Value | Description | How it works |
|---|---|---|---|
| Base Step Penalty | -0.1 | Applies to every step taken, preventing the agent from wandering indefinitely. | Encourages more purposeful movement to reach the goal. |
| Goal Reward | +1000 | Granted when the agent reaches the goal location. | Strong incentive to navigate successfully. |
| Distance-Based: Closer | +2 | Added when the new position is closer to the goal than the old position. | Rewards steps that move the agent toward the goal. |
| Distance-Based: Farther | -1 | Subtracted if the agent moves away from the goal. | Penalizes counterproductive moves. |
| Collision (Wall/Out-of-Bounds) | -1 | Occurs if the agent attempts to move into a wall or outside the maze. | Discourages invalid moves. |
| Visiting New State | +1 | Reward for exploring unvisited cells. | Encourages exploration of unknown territory. |
| Revisiting State Penalty | -1 | Applied if the agent re-enters a previously visited state. | Discourages oscillation or toggling between the same cells. |
| Loop Detection Threshold | 20 | Looks for patterns in steps. | Help the agent to avoid looping. |
The following table shows my attempts at training the agent in maze solving. Note that the Q-Tables approach chosen is mainly used to teach an agent how to solve a particular problem and not generalizing a problem. This is why creating an agent that successfully solves "new" mazes was hard but the agent excelled in solving the mazes he was trained on. I decided to still try to create a generalized maze solver agent to better understand the overall process:
| Attempt | Parameters / Log Mention | Success Rate |
|---|---|---|
| 1 | First run with a higher base penalty, small goal reward, many steps wandering. | 0.05% |
| 2 | Adjusted parameters (some increased goal reward, smaller step penalty). | 6.15% |
| 3 | Another set of reward changes adjusting all rewards slightly. | 8.75% |
| 4 | Eventually changed to a bigger goal reward to 1000 after and reduced epsilon and temperature decay from 0.999 to 0.9999. | 24.14% |
| 5 | Bigger reward for goal reaching, increased epsilon and temperature decay (0.9995) while also increasing rewards for exploration. | 39.38% |
| 6 | I increased the rewards for exploration further and adjusted the max step count as well as the learning parameters. | 55.80% |
| 7 | Changed the alpha (0.6) and gamma value (0.96). | 57.77% |
| 8 | Removed penalties for walking into walls and moving further away from the goal while also adjusting all reward sizes to simplify the model. Adjust alpha (0.44) and gamma value (0.9). | 58.59% |
| 9 | Started to use curriculum learning (increasing complexity over time). | 76.45% |
| 10 | Further decreased complexity for the starting mazes until they actually where an empty space with start end endpoint. | 87.47% |
| 11 | Fine tuning of parameteres and a lot of trial and error. | 89.93% |
During my training, I initially ran into a significant problem: my agent was racking up high rewards without truly solving new mazes (see video below). I realized it was exploiting a classic “oscillation” loophole, moving back and forth to collect a net positive reward by gaining +6 for moving closer and only −1 for moving farther away. To fix this, I added a penalty for revisiting any cell, ensuring the agent wouldn’t profit by simply toggling positions.
I continued by further tweaking and changing the rewards and learning parameteres. This part was very hard as it was not always clear why the agent acted in a specific way and how the changes of parameteres affected it. I decided to break the process down by starting with a very low complexity environment and constantly increasing the complexity. My first attempt was to build an agent that can simply walk from start to finish in an empty maze and then increasing the complexity of the maze over time.
After some training with mazes and obsticle arenas I tested the agent on new maps with new obsticles and it completed them all successfully (see video below). The agent may not have used the best route but it achieved to reach the goal. Further finetuning, more learning data and training are required. I will continue to train the model with increasing difficulting using cirriculum learning.
Some further fine-tuning of rewards as well as testing different model training parameters helped to make the agent capable of solving parkours that were a bit more complex. Note that the agent does still not follow any meaningful path and requires more fine-tuning:
At this point I decided to end the discovery process of creating a Q-Table agent to solve "new" / "unknown" mazes. The results are documented in the results page.
The Code:
import os
import json
import pickle
import random
from collections import deque
import numpy as np
import matplotlib.pyplot as plt
#configuration part
CONFIG = {
#reward structure
"step_penalty": -0.1, #was -0.1
"goal_reward": 1000, #was 1000
"reward_closer": 2, #was 2
"reward_farther": -1, #was -1
"wall_penalty": -1, #was -1
"exploration_bonus": 1, #was 1
"revisit_penalty": -1, #was -1
"loop_detection_threshold": 1000, #was 1000
#learning parameters
"episodes": 37000, #was 12000
"alpha": 0.125, #was 0.0125
"gamma": 0.99,
"epsilon_start": 1.0,
"epsilon_end": 0.15, #was 0.15
"epsilon_decay": 0.9998,
"temperature_start": 1.0,
"temperature_end": 0.15, #was 0.15
"temperature_decay": 0.9998,
"max_steps": 1000,
"max_stuck_steps": 1000, #was 1000
"optimistic_value": 10,
#replay structure
"use_experience_replay": True,
"replay_buffer_size": 100000,
"replay_batch_size": 64,
"replay_frequency": 4,
#prioritized sweeping
"use_prioritized_sweeping": True,
"priority_threshold": 0.02,
#curriculum learning option
"use_curriculum_learning": True,
"curriculum_phases": 6
}
#maze environment
class MazeEnv:
def __init__(self, maze, start, goal):
self.maze = np.array(maze)
self.start = tuple(start)
self.goal = tuple(goal)
# Define actions: right, down, left, up
self.actions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
self.reset()
#maze complexity metrics
self.complexity = self.calculate_complexity()
def calculate_complexity(self):
#get manhattan distance
manhattan_dist = abs(self.start[0] - self.goal[0]) + abs(self.start[1] - self.goal[1])
#number of walls
wall_count = np.sum(self.maze == 1)
total_cells = self.maze.shape[0] * self.maze.shape[1]
wall_ratio = wall_count / total_cells
#combine metrics
return manhattan_dist * (1 + wall_ratio)
def reset(self):
self.position = self.start
self.visited_states = {self.start: 1} # Track visit count
self.path = [self.start] # Track full path for analysis
return self.position
def step(self, action):
x, y = self.position
dx, dy = self.actions[action]
nx, ny = x + dx, y + dy
#check boundaries/walls
if (0 <= nx < self.maze.shape[0] and 0 <= ny < self.maze.shape[1] and self.maze[nx, ny] == 0):
new_position = (nx, ny)
else:
new_position = self.position
done = False
#base step reward
reward = CONFIG["step_penalty"]
#check if goal is reached
if new_position == self.goal:
reward = CONFIG["goal_reward"]
done = True
else:
#distance-based reward/penalty
old_dist = abs(x - self.goal[0]) + abs(y - self.goal[1])
new_dist = abs(new_position[0] - self.goal[0]) + abs(new_position[1] - self.goal[1])
if new_dist < old_dist:
reward += CONFIG["reward_closer"]
elif new_dist > old_dist:
reward += CONFIG["reward_farther"]
#penalty for hitting a wall or out-of-bounds
if new_position == self.position and (nx, ny) != self.position:
reward += CONFIG["wall_penalty"]
#exploration bonus or revisit penalty with increasing penalty for frequently visited states
if new_position not in self.visited_states:
reward += CONFIG["exploration_bonus"]
self.visited_states[new_position] = 1
else:
#increasing penalty for frequently visited states
visit_count = self.visited_states[new_position]
revisit_penalty = CONFIG["revisit_penalty"] * (1 + 0.1 * visit_count)
reward += revisit_penalty
self.visited_states[new_position] += 1
self.position = new_position
self.path.append(new_position)
return new_position, reward, done
#Helper Functions
def load_mazes(folder):
mazes = []
for fname in os.listdir(folder):
if fname.endswith(".json"):
path = os.path.join(folder, fname)
with open(path, "r") as f:
mazes.append(json.load(f))
return mazes
def sort_mazes_by_complexity(mazes):
sorted_mazes = []
for maze_data in mazes:
env = MazeEnv(maze_data["maze"], maze_data["start"], maze_data["goal"])
sorted_mazes.append((env.complexity, maze_data))
#sort by complexity score
sorted_mazes.sort(key=lambda x: x[0])
return sorted_mazes
def select_action(q_values, epsilon, temperature):
if np.random.rand() < epsilon:
return np.random.randint(len(q_values))
#softmax action selection with temperature
max_q = max(q_values)
exp_values = [np.exp((q - max_q) / max(temperature, 1e-8)) for q in q_values]
sum_exp = sum(exp_values)
probs = [val / sum_exp for val in exp_values]
return np.random.choice(len(q_values), p=probs)
def get_q_values(q_table, state, num_actions=4):
if state not in q_table:
q_table[state] = [CONFIG["optimistic_value"]] * num_actions
return q_table[state]
def experience_replay(q_table, experience_buffer, batch_size, alpha, gamma):
if len(experience_buffer) < batch_size:
return q_table
#sample random batch of experiences
batch = random.sample(experience_buffer, batch_size)
for state, action, reward, next_state, done in batch:
q_values = get_q_values(q_table, state)
if done:
new_q = q_values[action] + alpha * (reward - q_values[action])
else:
next_q_values = get_q_values(q_table, next_state)
best_next_q = max(next_q_values)
new_q = q_values[action] + alpha * (reward + gamma * best_next_q - q_values[action])
q_values[action] = new_q
q_table[state] = q_values
return q_table
def prioritized_sweeping(q_table, state_transitions, alpha, gamma, priority_threshold=0.5):
if not state_transitions:
return q_table
#sort transitions by absolute TD error (priority)
for state, action, reward, next_state, done in state_transitions:
q_values = get_q_values(q_table, state)
if done:
td_error = abs(reward - q_values[action])
else:
next_q_values = get_q_values(q_table, next_state)
best_next_q = max(next_q_values)
td_error = abs(reward + gamma * best_next_q - q_values[action])
#only update if error is significant
if td_error > priority_threshold:
if done:
new_q = q_values[action] + alpha * (reward - q_values[action])
else:
new_q = q_values[action] + alpha * (reward + gamma * best_next_q - q_values[action])
q_values[action] = new_q
q_table[state] = q_values
return q_table
def detect_early_stopping(rewards, episode_window=500, improvement_threshold=0.001):
return False # Never trigger early stopping
def visualize_sample_mazes(sorted_mazes, num_to_visualize=3):
os.makedirs("maze_visualizations", exist_ok=True)
for i in range(min(num_to_visualize, len(sorted_mazes))):
complexity, maze_data = sorted_mazes[i]
maze = np.array(maze_data["maze"])
start = tuple(maze_data["start"])
goal = tuple(maze_data["goal"])
plt.figure(figsize=(8, 8))
plt.imshow(maze, cmap='binary')
plt.scatter(start[1], start[0], c='green', s=200, marker='o', label='Start')
plt.scatter(goal[1], goal[0], c='red', s=200, marker='*', label='Goal')
plt.title(f'Maze #{i+1} - Complexity: {complexity:.2f}')
plt.legend()
plt.grid(False)
plt.savefig(f"maze_visualizations/maze_sample_{i+1}.png")
plt.close()
print(f"Visualized {num_to_visualize} sample mazes in 'maze_visualizations' folder")
def train_q_learning(
mazes,
episodes=CONFIG["episodes"],
alpha=CONFIG["alpha"],
gamma=CONFIG["gamma"],
epsilon_start=CONFIG["epsilon_start"],
epsilon_end=CONFIG["epsilon_end"],
epsilon_decay=CONFIG["epsilon_decay"],
temperature_start=CONFIG["temperature_start"],
temperature_end=CONFIG["temperature_end"],
temperature_decay=CONFIG["temperature_decay"],
max_steps=CONFIG["max_steps"],
max_stuck_steps=CONFIG["max_stuck_steps"]
):
#tracking training progress
q_table = {}
total_rewards = []
success_count = 0
steps_per_episode = []
epsilon_values = []
success_rates = []
mean_reward_per_1000 = []
epsilon = epsilon_start
temperature = temperature_start
#sort mazes for curriculum learning if enabled
if CONFIG["use_curriculum_learning"]:
sorted_mazes = sort_mazes_by_complexity(mazes)
print(f"Sorted {len(sorted_mazes)} mazes by complexity for curriculum learning")
#visualize a few sample mazes to verify they're solvable
visualize_sample_mazes(sorted_mazes, num_to_visualize=3)
#print complexity information for first few mazes
print("Maze complexity distribution:")
for i, (complexity, maze) in enumerate(sorted_mazes):
print(f" Maze {i+1}: Complexity score {complexity:.2f}")
sorted_mazes = [m[1] for m in sorted_mazes] # Extract just the maze data
else:
sorted_mazes = mazes
#experience replay buffer
replay_buffer = deque(maxlen=CONFIG["replay_buffer_size"])
#create a directory for saving figures
os.makedirs("training_plots", exist_ok=True)
for ep in range(1, episodes + 1):
#curriculum learning: gradually introduce more complex mazes
if CONFIG["use_curriculum_learning"]:
# Determine which subset of mazes to use based on training progress
phase = min(int((ep / episodes) * CONFIG["curriculum_phases"]), CONFIG["curriculum_phases"] - 1)
maze_subset_size = max(1, int((phase + 1) * len(sorted_mazes) / CONFIG["curriculum_phases"]))
maze_subset = sorted_mazes[:maze_subset_size]
maze_data = random.choice(maze_subset)
else:
maze_data = random.choice(mazes)
env = MazeEnv(maze_data["maze"], maze_data["start"], maze_data["goal"])
state = env.reset()
episode_reward = 0
done = False
steps = 0
#loop detection
state_history = deque(maxlen=20) # Larger history for better loop detection
stuck_count = 0
#prioritized sweeping
state_transitions = []
while not done and steps < max_steps:
steps += 1
state_history.append(state)
#detect if stuck in a loop
if state_history.count(state) > CONFIG["loop_detection_threshold"]:
action = np.random.randint(len(env.actions))
else:
q_values = get_q_values(q_table, state)
action = select_action(q_values, epsilon, temperature)
next_state, reward, done = env.step(action)
episode_reward += reward
#add to experience replay buffer
replay_buffer.append((state, action, reward, next_state, done))
#track state transitions for prioritized sweeping
state_transitions.append((state, action, reward, next_state, done))
#update stuck counter
if next_state == state:
stuck_count += 1
else:
stuck_count = 0
#force a random move if stuck
if stuck_count >= max_stuck_steps and not done:
action = np.random.randint(len(env.actions))
next_state, forced_reward, forced_done = env.step(action)
episode_reward += forced_reward
done = forced_done
stuck_count = 0
#add forced move to replay buffer
replay_buffer.append((state, action, forced_reward, next_state, forced_done))
state_transitions.append((state, action, forced_reward, next_state, forced_done))
#standard Q-learning update
q_values = get_q_values(q_table, state)
old_q = q_values[action]
next_q_values = get_q_values(q_table, next_state)
best_next_q = max(next_q_values)
new_q = old_q + alpha * (reward + gamma * best_next_q - old_q)
q_values[action] = new_q
q_table[state] = q_values
state = next_state
#experience replay after episode if enabled
if CONFIG["use_experience_replay"] and ep % CONFIG["replay_frequency"] == 0:
q_table = experience_replay(
q_table,
replay_buffer,
CONFIG["replay_batch_size"],
alpha,
gamma
)
#prioritized sweeping if enabled
if CONFIG["use_prioritized_sweeping"]:
q_table = prioritized_sweeping(
q_table,
state_transitions,
alpha,
gamma,
CONFIG["priority_threshold"]
)
#track episode results
total_rewards.append(episode_reward)
steps_per_episode.append(steps)
epsilon_values.append(epsilon)
if done and episode_reward > 0: # Successfully reached the goal
success_count += 1
#calculate current success rate
current_success_rate = 100.0 * success_count / ep
success_rates.append(current_success_rate)
#decay exploration parameters
epsilon = max(epsilon_end, epsilon * epsilon_decay)
temperature = max(temperature_end, temperature * temperature_decay)
#reporting
if ep % 1000 == 0:
last_1000_mean = np.mean(total_rewards[-1000:])
mean_reward_per_1000.append(last_1000_mean)
print(f"Episode {ep}/{episodes}: Reward={episode_reward:.1f}, Steps={steps}, "
f"Epsilon={epsilon:.3f}, Success Rate={current_success_rate:.2f}%")
if CONFIG["use_curriculum_learning"]:
print(f" Current Maze Complexity: {env.complexity:.2f}, "
f"Current Phase: {phase+1}/{CONFIG['curriculum_phases']}")
print(f" Average Last 1000 Rewards: {last_1000_mean:.2f}")
#plot training progress every 10,000 episodes
if ep % 10000 == 0:
plot_training_progress(
total_rewards,
steps_per_episode,
epsilon_values,
success_rates,
mean_reward_per_1000,
ep
)
#final statistics
avg_reward = np.mean(total_rewards)
success_rate = 100.0 * success_count / episodes
avg_steps_success = np.mean([s for i, s in enumerate(steps_per_episode) if i < len(total_rewards) and total_rewards[i] > 0]) if success_count > 0 else 0
print("\nTraining Finished!")
print(f" Average Reward: {avg_reward:.2f}")
print(f" Success Rate: {success_rate:.2f}%")
print(f" Average Steps (successful episodes): {avg_steps_success:.2f}")
#final learning curves
plot_training_progress(
total_rewards,
steps_per_episode,
epsilon_values,
success_rates,
mean_reward_per_1000,
episodes,
final=True
)
return q_table
def plot_training_progress(rewards, steps, epsilons, success_rates, mean_rewards_per_1000, episode, final=False):
plt.figure(figsize=(15, 12))
window_size = min(100, len(rewards))
if window_size > 0:
smoothed_rewards = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')
plt.subplot(3, 2, 1)
plt.plot(smoothed_rewards)
plt.title('Average Reward (Smoothed)')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.subplot(3, 2, 2)
smoothed_steps = np.convolve(steps, np.ones(window_size)/window_size, mode='valid')
plt.plot(smoothed_steps)
plt.title('Steps per Episode (Smoothed)')
plt.xlabel('Episode')
plt.ylabel('Steps')
plt.subplot(3, 2, 3)
plt.plot(epsilons)
plt.title('Exploration Rate (Epsilon)')
plt.xlabel('Episode')
plt.ylabel('Epsilon')
plt.subplot(3, 2, 4)
plt.plot(success_rates)
plt.title('Success Rate Progression')
plt.xlabel('Episode')
plt.ylabel('Success Rate (%)')
plt.ylim(0, 100)
plt.subplot(3, 2, 5)
plt.plot(range(1000, len(rewards) + 1, 1000), mean_rewards_per_1000)
plt.title('Mean Reward per 1000 Episodes')
plt.xlabel('Episode')
plt.ylabel('Mean Reward')
plt.subplot(3, 2, 6)
plt.hist(rewards, bins=50)
plt.title('Reward Distribution')
plt.xlabel('Reward')
plt.ylabel('Frequency')
plt.tight_layout()
#save figure
if final:
plt.savefig(f"training_plots/final_training_progress.png")
else:
plt.savefig(f"training_plots/training_progress_ep{episode}.png")
plt.close()
def save_q_table(q_table, output_folder, filename):
os.makedirs(output_folder, exist_ok=True)
filepath = os.path.join(output_folder, filename)
with open(filepath, "wb") as f:
pickle.dump(q_table, f)
print(f"Q-table saved to {filepath}")
def visualize_maze_policy(q_table, maze_data, output_folder="maze_visualizations"):
os.makedirs(output_folder, exist_ok=True)
maze = np.array(maze_data["maze"])
start = tuple(maze_data["start"])
goal = tuple(maze_data["goal"])
#create a visualization of the maze with policy arrows
plt.figure(figsize=(12, 12))
plt.imshow(maze, cmap='binary')
#action directions for visualization
arrows = ['→', '↓', '←', '↑']
arrow_colors = ['blue', 'green', 'red', 'purple']
#plot arrows for each navigable cell
for i in range(maze.shape[0]):
for j in range(maze.shape[1]):
if maze[i, j] == 0: # If navigable
state = (i, j)
if state in q_table:
q_values = q_table[state]
best_action = np.argmax(q_values)
plt.text(j, i-0.3, f"{max(q_values):.1f}",
ha='center', va='center',
color='black', fontsize=7)
plt.text(j, i, arrows[best_action],
ha='center', va='center',
color=arrow_colors[best_action],
fontsize=12, fontweight='bold')
#mark start and goal
plt.scatter(start[1], start[0], c='green', s=200, marker='o', label='Start')
plt.scatter(goal[1], goal[0], c='red', s=200, marker='*', label='Goal')
plt.title('Maze Policy Visualization')
plt.legend()
plt.grid(False)
#save the visualization
plt.savefig(f"{output_folder}/policy_visualization.png")
plt.close()
def visualize_training_metrics(total_rewards, success_rates, output_folder="training_analysis"):
os.makedirs(output_folder, exist_ok=True)
#analyze reward distribution
plt.figure(figsize=(15, 8))
#plot reward histogram
plt.subplot(1, 2, 1)
plt.hist(total_rewards, bins=50, color='skyblue', edgecolor='black')
plt.axvline(np.mean(total_rewards), color='r', linestyle='--', linewidth=1,
label=f'Mean: {np.mean(total_rewards):.2f}')
plt.axvline(np.median(total_rewards), color='g', linestyle='--', linewidth=1,
label=f'Median: {np.median(total_rewards):.2f}')
plt.title('Total Reward Distribution')
plt.xlabel('Reward')
plt.ylabel('Frequency')
plt.legend()
#success rate over time
plt.subplot(1, 2, 2)
plt.plot(success_rates, color='green')
plt.title('Success Rate Progression')
plt.xlabel('Episode')
plt.ylabel('Success Rate (%)')
plt.ylim(0, 100)
plt.tight_layout()
plt.savefig(f"{output_folder}/training_metrics_analysis.png")
plt.close()
#Main
if __name__ == "__main__":
MAZE_FOLDER = "mazes"
MODEL_FOLDER = "model"
MODEL_FILE = "q_table.pkl"
print("Loading mazes for training...")
mazes = load_mazes(MAZE_FOLDER)
print(f"Loaded {len(mazes)} maze(s).")
print("Starting Q-learning training with enhanced features...")
print(f" - Experience Replay: {'Enabled' if CONFIG['use_experience_replay'] else 'Disabled'}")
print(f" - Prioritized Sweeping: {'Enabled' if CONFIG['use_prioritized_sweeping'] else 'Disabled'}")
print(f" - Curriculum Learning: {'Enabled' if CONFIG['use_curriculum_learning'] else 'Disabled'}")
q_table = train_q_learning(mazes=mazes)
save_q_table(q_table, MODEL_FOLDER, MODEL_FILE)
if mazes:
print("Generating policy visualization for a sample maze...")
visualize_maze_policy(q_table, mazes[0])
# Create additional analysis visualizations
print("Generating additional training analysis visualizations...")
visualize_maze_policy(q_table, mazes[-1], output_folder="maze_visualizations")
print("Training complete! You're ready to go solve some mazes!")