Project 5: Q-Table Reinforcement Learning Maze Solver

Reinforcement Learning to resolve the maze

The mazeBuilder.py aims at:

  • Implementing a Q-learning algorithm to train an agent for solving mazes.
  • Incorporating movement rewards and penalties, including incentives for exploration.
  • Allowing configuration of training parameters such as learning rate (alpha), discount factor (gamma), and exploration rate (epsilon).
  • Tracking of training metrics for performance evaluation and bug fixing.
  • Supporting saving the trained Q-table model for future use and testing.

Below is a detailed overview of the key training parameters I used to build and refine my RL model. Each parameter is carefully tuned to strike a balance between exploration, exploitation, and overall performance.

Parameter Value Description How it works
EPISODES 1'000-2'000 per maze in the test data set Number of training episodes to run the agent. Larger values typically allow more exploration and learning.
ALPHA (Learning Rate or LR) 0.0125 The learning rate determines how quickly the neural network updates its weights during training. A small learning rate, like 0.0005, makes sure that updates are gradual, leading to stable learning. Higher values can speed up learning but risk overshooting.
GAMMA 0.99 Gamma controls how much the agent cares about future rewards. In the Q-learning update, the future reward is multiplied by gamma. So a reward received one step ahead is valued at 0.99 times its actual value, two steps ahead at 0.97², and so on. Closer to 1.0 means the agent heavily favors long-term gains.
EPSILON_START 1.0 Epsilon determines how often the agent takes a random action (exploration) versus relying on its learned policy (exploitation). Initially explores fully (random actions).
EPSILON_END 0.15 Minimum exploration rate after decay. Agent still explores occasionally even at low epsilon.
EPSILON_DECAY 0.9998 Rate at which epsilon decreases each episode. Slowly reduces exploration over time.
TEMPERATURE_START 1.0 Initial temperature for softmax action selection (if used). Temperature shapes the “smoothness” of the action probabilities when you’re following a softmax strategy. Higher temperature means more random exploration initially.
TEMPERATURE_END 0.15 Final temperature after decay. Encourages more deterministic choices later in training.
TEMPERATURE_DECAY 0.9998 Rate at which the temperature decreases each episode. Gradually shifts behavior from exploratory to exploitative.
MAX_STEPS 1'000 Maximum steps the agent is allowed to take in a single episode. Prevents episodes from running indefinitely.
MAX_STUCK_STEPS 20 Maximum consecutive steps with no movement the agent can take (being stuck). Helps the agent avoid getting stuck in loops or dead ends.

Below is a detailed overview of the reward structure I implemented during training. Each component reflects how the agent is incentivized or penalized for specific actions and outcomes, guiding it to navigate the maze more effectively and reach the goal.

Parameter Value Description How it works
Base Step Penalty -0.1 Applies to every step taken, preventing the agent from wandering indefinitely. Encourages more purposeful movement to reach the goal.
Goal Reward +1000 Granted when the agent reaches the goal location. Strong incentive to navigate successfully.
Distance-Based: Closer +2 Added when the new position is closer to the goal than the old position. Rewards steps that move the agent toward the goal.
Distance-Based: Farther -1 Subtracted if the agent moves away from the goal. Penalizes counterproductive moves.
Collision (Wall/Out-of-Bounds) -1 Occurs if the agent attempts to move into a wall or outside the maze. Discourages invalid moves.
Visiting New State +1 Reward for exploring unvisited cells. Encourages exploration of unknown territory.
Revisiting State Penalty -1 Applied if the agent re-enters a previously visited state. Discourages oscillation or toggling between the same cells.
Loop Detection Threshold 20 Looks for patterns in steps. Help the agent to avoid looping.

The following table shows my attempts at training the agent in maze solving. Note that the Q-Tables approach chosen is mainly used to teach an agent how to solve a particular problem and not generalizing a problem. This is why creating an agent that successfully solves "new" mazes was hard but the agent excelled in solving the mazes he was trained on. I decided to still try to create a generalized maze solver agent to better understand the overall process:

Attempt Parameters / Log Mention Success Rate
1 First run with a higher base penalty, small goal reward, many steps wandering. 0.05%
2 Adjusted parameters (some increased goal reward, smaller step penalty). 6.15%
3 Another set of reward changes adjusting all rewards slightly. 8.75%
4 Eventually changed to a bigger goal reward to 1000 after and reduced epsilon and temperature decay from 0.999 to 0.9999. 24.14%
5 Bigger reward for goal reaching, increased epsilon and temperature decay (0.9995) while also increasing rewards for exploration. 39.38%
6 I increased the rewards for exploration further and adjusted the max step count as well as the learning parameters. 55.80%
7 Changed the alpha (0.6) and gamma value (0.96). 57.77%
8 Removed penalties for walking into walls and moving further away from the goal while also adjusting all reward sizes to simplify the model. Adjust alpha (0.44) and gamma value (0.9). 58.59%
9 Started to use curriculum learning (increasing complexity over time). 76.45%
10 Further decreased complexity for the starting mazes until they actually where an empty space with start end endpoint. 87.47%
11 Fine tuning of parameteres and a lot of trial and error. 89.93%

During my training, I initially ran into a significant problem: my agent was racking up high rewards without truly solving new mazes (see video below). I realized it was exploiting a classic “oscillation” loophole, moving back and forth to collect a net positive reward by gaining +6 for moving closer and only −1 for moving farther away. To fix this, I added a penalty for revisiting any cell, ensuring the agent wouldn’t profit by simply toggling positions.



I continued by further tweaking and changing the rewards and learning parameteres. This part was very hard as it was not always clear why the agent acted in a specific way and how the changes of parameteres affected it. I decided to break the process down by starting with a very low complexity environment and constantly increasing the complexity. My first attempt was to build an agent that can simply walk from start to finish in an empty maze and then increasing the complexity of the maze over time.



After some training with mazes and obsticle arenas I tested the agent on new maps with new obsticles and it completed them all successfully (see video below). The agent may not have used the best route but it achieved to reach the goal. Further finetuning, more learning data and training are required. I will continue to train the model with increasing difficulting using cirriculum learning.



Some further fine-tuning of rewards as well as testing different model training parameters helped to make the agent capable of solving parkours that were a bit more complex. Note that the agent does still not follow any meaningful path and requires more fine-tuning:



At this point I decided to end the discovery process of creating a Q-Table agent to solve "new" / "unknown" mazes. The results are documented in the results page.

The Code:


import os
import json
import pickle
import random
from collections import deque
import numpy as np
import matplotlib.pyplot as plt

#configuration part
CONFIG = {

	#reward structure
	"step_penalty": -0.1, #was -0.1
	"goal_reward": 1000, #was 1000
	"reward_closer": 2, #was 2
	"reward_farther": -1, #was -1
	"wall_penalty": -1, #was -1
	"exploration_bonus": 1, #was 1
	"revisit_penalty": -1, #was -1
	"loop_detection_threshold": 1000, #was 1000

	#learning parameters
	"episodes": 37000, #was 12000
	"alpha": 0.125, #was 0.0125
	"gamma": 0.99,
	"epsilon_start": 1.0,
	"epsilon_end": 0.15, #was 0.15
	"epsilon_decay": 0.9998,
	"temperature_start": 1.0,
	"temperature_end": 0.15, #was 0.15
	"temperature_decay": 0.9998,
	"max_steps": 1000,
	"max_stuck_steps": 1000, #was 1000
	"optimistic_value": 10,

	#replay structure
	"use_experience_replay": True,
	"replay_buffer_size": 100000,
	"replay_batch_size": 64,
	"replay_frequency": 4,

	#prioritized sweeping
	"use_prioritized_sweeping": True,
	"priority_threshold": 0.02,

	#curriculum learning option
	"use_curriculum_learning": True,
	"curriculum_phases": 6
}

#maze environment
class MazeEnv:
	def __init__(self, maze, start, goal):
		self.maze = np.array(maze)
		self.start = tuple(start)
		self.goal = tuple(goal)
		# Define actions: right, down, left, up
		self.actions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
		self.reset()
		
		#maze complexity metrics
		self.complexity = self.calculate_complexity()

	def calculate_complexity(self):
		#get manhattan distance
		manhattan_dist = abs(self.start[0] - self.goal[0]) + abs(self.start[1] - self.goal[1])
		
		#number of walls
		wall_count = np.sum(self.maze == 1)
		total_cells = self.maze.shape[0] * self.maze.shape[1]
		wall_ratio = wall_count / total_cells
		
		#combine metrics
		return manhattan_dist * (1 + wall_ratio)

	def reset(self):
		self.position = self.start
		self.visited_states = {self.start: 1}  # Track visit count
		self.path = [self.start]  # Track full path for analysis
		return self.position

	def step(self, action):
		x, y = self.position
		dx, dy = self.actions[action]
		nx, ny = x + dx, y + dy

		#check boundaries/walls
		if (0 <= nx < self.maze.shape[0] and 0 <= ny < self.maze.shape[1] and self.maze[nx, ny] == 0):
			new_position = (nx, ny)
		else:
			new_position = self.position

		done = False

		#base step reward
		reward = CONFIG["step_penalty"]

		#check if goal is reached
		if new_position == self.goal:
			reward = CONFIG["goal_reward"]
			done = True
		else:
			#distance-based reward/penalty
			old_dist = abs(x - self.goal[0]) + abs(y - self.goal[1])
			new_dist = abs(new_position[0] - self.goal[0]) + abs(new_position[1] - self.goal[1])
			if new_dist < old_dist:
				reward += CONFIG["reward_closer"]
			elif new_dist > old_dist:
				reward += CONFIG["reward_farther"]

			#penalty for hitting a wall or out-of-bounds
			if new_position == self.position and (nx, ny) != self.position:
				reward += CONFIG["wall_penalty"]

			#exploration bonus or revisit penalty with increasing penalty for frequently visited states
			if new_position not in self.visited_states:
				reward += CONFIG["exploration_bonus"]
				self.visited_states[new_position] = 1
			else:
				#increasing penalty for frequently visited states
				visit_count = self.visited_states[new_position]
				revisit_penalty = CONFIG["revisit_penalty"] * (1 + 0.1 * visit_count)
				reward += revisit_penalty
				self.visited_states[new_position] += 1

		self.position = new_position
		self.path.append(new_position)
		return new_position, reward, done

#Helper Functions
def load_mazes(folder):
	mazes = []
	for fname in os.listdir(folder):
		if fname.endswith(".json"):
			path = os.path.join(folder, fname)
			with open(path, "r") as f:
				mazes.append(json.load(f))
	return mazes

def sort_mazes_by_complexity(mazes):
	sorted_mazes = []
	for maze_data in mazes:
		env = MazeEnv(maze_data["maze"], maze_data["start"], maze_data["goal"])
		sorted_mazes.append((env.complexity, maze_data))
	
	#sort by complexity score
	sorted_mazes.sort(key=lambda x: x[0])
	return sorted_mazes

def select_action(q_values, epsilon, temperature):
	if np.random.rand() < epsilon:
		return np.random.randint(len(q_values))
	
	#softmax action selection with temperature
	max_q = max(q_values)
	exp_values = [np.exp((q - max_q) / max(temperature, 1e-8)) for q in q_values]
	sum_exp = sum(exp_values)
	probs = [val / sum_exp for val in exp_values]
	return np.random.choice(len(q_values), p=probs)

def get_q_values(q_table, state, num_actions=4):
	if state not in q_table:
		q_table[state] = [CONFIG["optimistic_value"]] * num_actions
	return q_table[state]

def experience_replay(q_table, experience_buffer, batch_size, alpha, gamma):
	if len(experience_buffer) < batch_size:
		return q_table
	
	#sample random batch of experiences
	batch = random.sample(experience_buffer, batch_size)
	
	for state, action, reward, next_state, done in batch:
		q_values = get_q_values(q_table, state)
		if done:
			new_q = q_values[action] + alpha * (reward - q_values[action])
		else:
			next_q_values = get_q_values(q_table, next_state)
			best_next_q = max(next_q_values)
			new_q = q_values[action] + alpha * (reward + gamma * best_next_q - q_values[action])
		
		q_values[action] = new_q
		q_table[state] = q_values
	
	return q_table

def prioritized_sweeping(q_table, state_transitions, alpha, gamma, priority_threshold=0.5):
	if not state_transitions:
		return q_table
	
	#sort transitions by absolute TD error (priority)
	for state, action, reward, next_state, done in state_transitions:
		q_values = get_q_values(q_table, state)
		
		if done:
			td_error = abs(reward - q_values[action])
		else:
			next_q_values = get_q_values(q_table, next_state)
			best_next_q = max(next_q_values)
			td_error = abs(reward + gamma * best_next_q - q_values[action])
		
		#only update if error is significant
		if td_error > priority_threshold:
			if done:
				new_q = q_values[action] + alpha * (reward - q_values[action])
			else:
				new_q = q_values[action] + alpha * (reward + gamma * best_next_q - q_values[action])
			
			q_values[action] = new_q
			q_table[state] = q_values
	
	return q_table

def detect_early_stopping(rewards, episode_window=500, improvement_threshold=0.001):
	return False  # Never trigger early stopping

def visualize_sample_mazes(sorted_mazes, num_to_visualize=3):
	os.makedirs("maze_visualizations", exist_ok=True)
	
	for i in range(min(num_to_visualize, len(sorted_mazes))):
		complexity, maze_data = sorted_mazes[i]
		maze = np.array(maze_data["maze"])
		start = tuple(maze_data["start"])
		goal = tuple(maze_data["goal"])
		
		plt.figure(figsize=(8, 8))
		plt.imshow(maze, cmap='binary')
		plt.scatter(start[1], start[0], c='green', s=200, marker='o', label='Start')
		plt.scatter(goal[1], goal[0], c='red', s=200, marker='*', label='Goal')
		plt.title(f'Maze #{i+1} - Complexity: {complexity:.2f}')
		plt.legend()
		plt.grid(False)
		plt.savefig(f"maze_visualizations/maze_sample_{i+1}.png")
		plt.close()
	
	print(f"Visualized {num_to_visualize} sample mazes in 'maze_visualizations' folder")

def train_q_learning(
	mazes,
	episodes=CONFIG["episodes"],
	alpha=CONFIG["alpha"],
	gamma=CONFIG["gamma"],
	epsilon_start=CONFIG["epsilon_start"],
	epsilon_end=CONFIG["epsilon_end"],
	epsilon_decay=CONFIG["epsilon_decay"],
	temperature_start=CONFIG["temperature_start"],
	temperature_end=CONFIG["temperature_end"],
	temperature_decay=CONFIG["temperature_decay"],
	max_steps=CONFIG["max_steps"],
	max_stuck_steps=CONFIG["max_stuck_steps"]
):
	#tracking training progress
	q_table = {}
	total_rewards = []
	success_count = 0
	steps_per_episode = []
	epsilon_values = []
	success_rates = []
	mean_reward_per_1000 = []

	epsilon = epsilon_start
	temperature = temperature_start
	
	#sort mazes for curriculum learning if enabled
	if CONFIG["use_curriculum_learning"]:
		sorted_mazes = sort_mazes_by_complexity(mazes)
		print(f"Sorted {len(sorted_mazes)} mazes by complexity for curriculum learning")
		
		#visualize a few sample mazes to verify they're solvable
		visualize_sample_mazes(sorted_mazes, num_to_visualize=3)
		
		#print complexity information for first few mazes
		print("Maze complexity distribution:")
		for i, (complexity, maze) in enumerate(sorted_mazes):
			print(f"  Maze {i+1}: Complexity score {complexity:.2f}")

		sorted_mazes = [m[1] for m in sorted_mazes]  # Extract just the maze data
	else:
		sorted_mazes = mazes
	
	#experience replay buffer
	replay_buffer = deque(maxlen=CONFIG["replay_buffer_size"])
	
	#create a directory for saving figures
	os.makedirs("training_plots", exist_ok=True)

	for ep in range(1, episodes + 1):
		#curriculum learning: gradually introduce more complex mazes
		if CONFIG["use_curriculum_learning"]:
			# Determine which subset of mazes to use based on training progress
			phase = min(int((ep / episodes) * CONFIG["curriculum_phases"]), CONFIG["curriculum_phases"] - 1)
			maze_subset_size = max(1, int((phase + 1) * len(sorted_mazes) / CONFIG["curriculum_phases"]))
			maze_subset = sorted_mazes[:maze_subset_size]
			maze_data = random.choice(maze_subset)
		else:
			maze_data = random.choice(mazes)
		
		env = MazeEnv(maze_data["maze"], maze_data["start"], maze_data["goal"])
		state = env.reset()
		episode_reward = 0
		done = False
		steps = 0
		
		#loop detection
		state_history = deque(maxlen=20)  # Larger history for better loop detection
		stuck_count = 0
		
		#prioritized sweeping
		state_transitions = []

		while not done and steps < max_steps:
			steps += 1
			state_history.append(state)

			#detect if stuck in a loop
			if state_history.count(state) > CONFIG["loop_detection_threshold"]:
				action = np.random.randint(len(env.actions))
			else:
				q_values = get_q_values(q_table, state)
				action = select_action(q_values, epsilon, temperature)

			next_state, reward, done = env.step(action)
			episode_reward += reward
			
			#add to experience replay buffer
			replay_buffer.append((state, action, reward, next_state, done))
			
			#track state transitions for prioritized sweeping
			state_transitions.append((state, action, reward, next_state, done))

			#update stuck counter
			if next_state == state:
				stuck_count += 1
			else:
				stuck_count = 0

			#force a random move if stuck
			if stuck_count >= max_stuck_steps and not done:
				action = np.random.randint(len(env.actions))
				next_state, forced_reward, forced_done = env.step(action)
				episode_reward += forced_reward
				done = forced_done
				stuck_count = 0
				
				#add forced move to replay buffer
				replay_buffer.append((state, action, forced_reward, next_state, forced_done))
				state_transitions.append((state, action, forced_reward, next_state, forced_done))

			#standard Q-learning update
			q_values = get_q_values(q_table, state)
			old_q = q_values[action]
			next_q_values = get_q_values(q_table, next_state)
			best_next_q = max(next_q_values)
			new_q = old_q + alpha * (reward + gamma * best_next_q - old_q)
			q_values[action] = new_q
			q_table[state] = q_values

			state = next_state
		
		#experience replay after episode if enabled
		if CONFIG["use_experience_replay"] and ep % CONFIG["replay_frequency"] == 0:
			q_table = experience_replay(
				q_table, 
				replay_buffer, 
				CONFIG["replay_batch_size"], 
				alpha, 
				gamma
			)
		
		#prioritized sweeping if enabled
		if CONFIG["use_prioritized_sweeping"]:
			q_table = prioritized_sweeping(
				q_table, 
				state_transitions, 
				alpha, 
				gamma, 
				CONFIG["priority_threshold"]
			)
		
		#track episode results
		total_rewards.append(episode_reward)
		steps_per_episode.append(steps)
		epsilon_values.append(epsilon)
		
		if done and episode_reward > 0:  # Successfully reached the goal
			success_count += 1
		
		#calculate current success rate
		current_success_rate = 100.0 * success_count / ep
		success_rates.append(current_success_rate)

		#decay exploration parameters
		epsilon = max(epsilon_end, epsilon * epsilon_decay)
		temperature = max(temperature_end, temperature * temperature_decay)

		#reporting
		if ep % 1000 == 0:
			last_1000_mean = np.mean(total_rewards[-1000:])
			mean_reward_per_1000.append(last_1000_mean)
			
			print(f"Episode {ep}/{episodes}: Reward={episode_reward:.1f}, Steps={steps}, "
					f"Epsilon={epsilon:.3f}, Success Rate={current_success_rate:.2f}%")
			
			if CONFIG["use_curriculum_learning"]:
				print(f"  Current Maze Complexity: {env.complexity:.2f}, "
						f"Current Phase: {phase+1}/{CONFIG['curriculum_phases']}")
			
			print(f"  Average Last 1000 Rewards: {last_1000_mean:.2f}")
			
			#plot training progress every 10,000 episodes
			if ep % 10000 == 0:
				plot_training_progress(
					total_rewards,
					steps_per_episode,
					epsilon_values,
					success_rates,
					mean_reward_per_1000,
					ep
				)
	
	#final statistics
	avg_reward = np.mean(total_rewards)
	success_rate = 100.0 * success_count / episodes
	avg_steps_success = np.mean([s for i, s in enumerate(steps_per_episode) if i < len(total_rewards) and total_rewards[i] > 0]) if success_count > 0 else 0
	
	print("\nTraining Finished!")
	print(f"  Average Reward: {avg_reward:.2f}")
	print(f"  Success Rate: {success_rate:.2f}%")
	print(f"  Average Steps (successful episodes): {avg_steps_success:.2f}")
	
	#final learning curves
	plot_training_progress(
		total_rewards,
		steps_per_episode,
		epsilon_values,
		success_rates,
		mean_reward_per_1000,
		episodes,
		final=True
	)
	
	return q_table

def plot_training_progress(rewards, steps, epsilons, success_rates, mean_rewards_per_1000, episode, final=False):
	plt.figure(figsize=(15, 12))
	
	window_size = min(100, len(rewards))
	if window_size > 0:
		smoothed_rewards = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')
		
		plt.subplot(3, 2, 1)
		plt.plot(smoothed_rewards)
		plt.title('Average Reward (Smoothed)')
		plt.xlabel('Episode')
		plt.ylabel('Reward')
		
		plt.subplot(3, 2, 2)
		smoothed_steps = np.convolve(steps, np.ones(window_size)/window_size, mode='valid')
		plt.plot(smoothed_steps)
		plt.title('Steps per Episode (Smoothed)')
		plt.xlabel('Episode')
		plt.ylabel('Steps')
		
		plt.subplot(3, 2, 3)
		plt.plot(epsilons)
		plt.title('Exploration Rate (Epsilon)')
		plt.xlabel('Episode')
		plt.ylabel('Epsilon')
		
		plt.subplot(3, 2, 4)
		plt.plot(success_rates)
		plt.title('Success Rate Progression')
		plt.xlabel('Episode')
		plt.ylabel('Success Rate (%)')
		plt.ylim(0, 100)
		
		plt.subplot(3, 2, 5)
		plt.plot(range(1000, len(rewards) + 1, 1000), mean_rewards_per_1000)
		plt.title('Mean Reward per 1000 Episodes')
		plt.xlabel('Episode')
		plt.ylabel('Mean Reward')
		
		plt.subplot(3, 2, 6)
		plt.hist(rewards, bins=50)
		plt.title('Reward Distribution')
		plt.xlabel('Reward')
		plt.ylabel('Frequency')
	
	plt.tight_layout()
	
	#save figure
	if final:
		plt.savefig(f"training_plots/final_training_progress.png")
	else:
		plt.savefig(f"training_plots/training_progress_ep{episode}.png")
	
	plt.close()

def save_q_table(q_table, output_folder, filename):
	os.makedirs(output_folder, exist_ok=True)
	filepath = os.path.join(output_folder, filename)
	with open(filepath, "wb") as f:
		pickle.dump(q_table, f)
	print(f"Q-table saved to {filepath}")

def visualize_maze_policy(q_table, maze_data, output_folder="maze_visualizations"):
	os.makedirs(output_folder, exist_ok=True)
	
	maze = np.array(maze_data["maze"])
	start = tuple(maze_data["start"])
	goal = tuple(maze_data["goal"])
	
	#create a visualization of the maze with policy arrows
	plt.figure(figsize=(12, 12))
	plt.imshow(maze, cmap='binary')
	
	#action directions for visualization
	arrows = ['→', '↓', '←', '↑']
	arrow_colors = ['blue', 'green', 'red', 'purple']
	
	#plot arrows for each navigable cell
	for i in range(maze.shape[0]):
		for j in range(maze.shape[1]):
			if maze[i, j] == 0:  # If navigable
				state = (i, j)
				if state in q_table:
					q_values = q_table[state]
					best_action = np.argmax(q_values)
					
					plt.text(j, i-0.3, f"{max(q_values):.1f}", 
								ha='center', va='center', 
								color='black', fontsize=7)
					
					plt.text(j, i, arrows[best_action], 
								ha='center', va='center', 
								color=arrow_colors[best_action],
								fontsize=12, fontweight='bold')
	
	#mark start and goal
	plt.scatter(start[1], start[0], c='green', s=200, marker='o', label='Start')
	plt.scatter(goal[1], goal[0], c='red', s=200, marker='*', label='Goal')
	
	plt.title('Maze Policy Visualization')
	plt.legend()
	plt.grid(False)
	
	#save the visualization
	plt.savefig(f"{output_folder}/policy_visualization.png")
	plt.close()

def visualize_training_metrics(total_rewards, success_rates, output_folder="training_analysis"):
	os.makedirs(output_folder, exist_ok=True)
	
	#analyze reward distribution
	plt.figure(figsize=(15, 8))
	
	#plot reward histogram 
	plt.subplot(1, 2, 1)
	plt.hist(total_rewards, bins=50, color='skyblue', edgecolor='black')
	plt.axvline(np.mean(total_rewards), color='r', linestyle='--', linewidth=1, 
				label=f'Mean: {np.mean(total_rewards):.2f}')
	plt.axvline(np.median(total_rewards), color='g', linestyle='--', linewidth=1,
				label=f'Median: {np.median(total_rewards):.2f}')
	plt.title('Total Reward Distribution')
	plt.xlabel('Reward')
	plt.ylabel('Frequency')
	plt.legend()
	
	#success rate over time
	plt.subplot(1, 2, 2)
	plt.plot(success_rates, color='green')
	plt.title('Success Rate Progression')
	plt.xlabel('Episode')
	plt.ylabel('Success Rate (%)')
	plt.ylim(0, 100)
	
	plt.tight_layout()
	plt.savefig(f"{output_folder}/training_metrics_analysis.png")
	plt.close()

#Main
if __name__ == "__main__":
	MAZE_FOLDER = "mazes"
	MODEL_FOLDER = "model"
	MODEL_FILE = "q_table.pkl"

	print("Loading mazes for training...")
	mazes = load_mazes(MAZE_FOLDER)
	print(f"Loaded {len(mazes)} maze(s).")

	print("Starting Q-learning training with enhanced features...")
	print(f"  - Experience Replay: {'Enabled' if CONFIG['use_experience_replay'] else 'Disabled'}")
	print(f"  - Prioritized Sweeping: {'Enabled' if CONFIG['use_prioritized_sweeping'] else 'Disabled'}")
	print(f"  - Curriculum Learning: {'Enabled' if CONFIG['use_curriculum_learning'] else 'Disabled'}")
	
	q_table = train_q_learning(mazes=mazes)

	save_q_table(q_table, MODEL_FOLDER, MODEL_FILE)
	
	if mazes:
		print("Generating policy visualization for a sample maze...")
		visualize_maze_policy(q_table, mazes[0])
		
		# Create additional analysis visualizations
		print("Generating additional training analysis visualizations...")
		visualize_maze_policy(q_table, mazes[-1], output_folder="maze_visualizations")
	
	print("Training complete! You're ready to go solve some mazes!")