Understanding Reinforcement Learning — A Primer

wpnews.pro

Imagine teaching a dog to fetch a ball. You don’t hand the dog a manual titled “The Complete Guide to Ball Retrieval.” Instead, you throw the ball, and when the dog brings it back, you give it a treat. When the dog gets distracted and wanders off, you withhold the treat. Over dozens of repetitions, the dog learns that bringing the ball back leads to rewards, while ignoring the ball doesn’t. This process of learning through interaction, experimentation, and feedback is exactly what reinforcement learning does for artificial intelligence.

Reinforcement learning is fundamentally different from the other types of machine learning you might be familiar with. In supervised learning, we show the algorithm thousands of examples with correct answers, like showing a child flashcards where one side has a picture of an apple and the other side has the word “apple.”

In unsupervised learning, we give the algorithm data without answers and ask it to find patterns, like asking someone to organize a messy drawer without telling them how. But in reinforcement learning, we do something more interesting: we place an agent in an environment, give it a goal, and let it figure out how to achieve that goal through experimentation.

The agent doesn’t know the right answer in advance. It doesn’t have a dataset of correct moves to learn from. Instead, it takes actions, observes what happens, receives rewards or penalties, and gradually learns which actions tend to lead to good outcomes and which ones don’t. This is how DeepMind’s AlphaGo learned to beat world champions at Go, how robotic arms learn to grasp objects, and how autonomous vehicles learn to navigate roads. The agent learns by doing, making mistakes, and slowly improving its strategy based on the consequences of its actions.

“In reinforcement learning, the agent doesn’t know the right answer in advance. It doesn’t have a dataset of correct moves to learn from. Instead, it takes actions, observes what happens, receives rewards or penalties and gradually learns which actions tend to lead to good outcomes and which ones don’t.”

At the heart of every reinforcement learning problem are 5 fundamental components that work together in a continuous loop. Understanding each of these components and how they interact is essential to grasping how reinforcement learning actually works.

The agent is the learner or decision-maker. In our dog example, the dog is the agent. In a video game, the agent might be the character you control. In a self-driving car, the agent is the AI system making decisions about steering, acceleration, and braking. The agent exists to make decisions, and its entire purpose is to learn which decisions lead to the best outcomes. The agent doesn’t start out knowing anything; it begins with a blank slate and learns entirely from experience.

The environment is everything the agent interacts with. It’s the world in which the agent operates. For the dog, the environment includes the room, the ball, you as the trainer, and all the physical laws that govern how balls bounce and roll. For a chess-playing agent, the environment is the chessboard and the rules of chess. For a trading algorithm, the environment is the stock market with all its complexity, volatility, and rules. The environment responds to the agent’s actions and provides feedback. It’s important to note that the agent doesn’t control the environment; it can only influence it through its actions.

“The agent doesn’t control the environment; it can only influence it through its actions.”

A state represents a specific situation or configuration of the environment at a particular moment in time. When you’re teaching the dog to fetch, one state might be “ball has just been thrown and is in the air,” another state might be “ball has landed fifteen feet away,” and another might be “dog has ball in mouth and is five feet from owner.” States capture all the relevant information the agent needs to make a decision. In a video game, the state might include the positions of all characters, their health levels, available items, and the current score. The quality of the state representation is key: if you don’t include important information in your state, the agent won’t be able to make good decisions.\

“A state represents a specific situation or configuration of the environment at a particular moment in time.”

An action is something the agent can do to interact with the environment. Actions are the agent’s way of influencing its world. For the dog, actions might include “run toward ball,” “pick up ball,” “run toward owner,” or “lie down and take a nap.” For a chess agent, actions are the legal moves available given the current board position. For a robot learning to walk, actions are the specific motor commands sent to each joint and actuator. The set of available actions can change depending on the current state. In chess, the legal moves change with every move made. In the fetch example, the dog can’t pick up the ball if the ball isn’t within reach.

“An action is an agent interacting or influencing the environment”

The reward is the feedback signal that tells the agent whether its action was good or bad. Rewards are numbers: positive numbers for good outcomes and negative numbers (penalties) for bad outcomes. When the dog brings the ball back, it gets a positive reward (the treat, which we might represent as +10). When it ignores the ball, it gets zero or even a small negative reward (no treat, perhaps represented as -1 or 0). The reward is the only way the environment communicates value to the agent. The agent’s entire learning process is driven by a single objective: maximize the total reward it receives over time. This is crucial to understand — the agent doesn’t know what “fetching” means or why it’s good. It only knows that certain sequences of actions lead to higher rewards, and it will do whatever it takes to get those rewards.

“The reward is how the environment communicates value to the agent.”

These five components interact in a continuous cycle that we call the agent-environment loop. The agent observes the current state of the environment, chooses an action based on that state, executes that action, and then observes the new state and the reward it received. This new state becomes the current state, and the cycle repeats. Over thousands or millions of these cycles, the agent gradually learns which actions to take in which states to maximize its long-term rewards.

When a reinforcement learning agent first encounters an environment, it knows absolutely nothing. It’s like a newborn baby encountering the world for the first time — it doesn’t know what actions are good, what states are dangerous, or what strategies might lead to success. So what does it do? It explores randomly.

In the beginning, the agent takes random actions just to see what happens. If we’re training an agent to play a video game, it might randomly jump, run left, run right, shoot, or stand still without any rhyme or reason. Most of these random actions will lead to poor outcomes. The character might run off a cliff, walk into an enemy, or simply wander aimlessly. But occasionally, by pure chance, the agent will stumble upon something good. Maybe it randomly walks near a power-up and receives a reward, or it accidentally defeats an enemy and gets points. These moments are crucial because they’re the seeds from which learning grows.

Every time the agent takes an action, it records what happened: “I was in state S, I took action A, I ended up in new state S’, and I received reward R.” This experience is stored in the agent’s memory. Over time, the agent accumulates thousands or millions of these experiences. The agent then uses these experiences to build an understanding of the environment’s dynamics and the value of different actions in different states.

The key insight is that the agent doesn’t just care about immediate rewards. If you’re playing a video game, picking up a small coin might give you 1 point right now, but positioning yourself correctly might let you access a treasure chest worth 100 points a few moves later. The challenge is learning to value actions not just by their immediate payoff, but by their long-term consequences. This is what makes reinforcement learning both powerful and challenging.

To handle this temporal credit assignment problem, reinforcement learning agents use a concept called ** discounted future rewards**. The idea is simple: a reward you’ll receive right now is worth more than the same reward you might receive far in the future. We capture this idea mathematically with a discount factor (usually called gamma, γ) between 0 and 1. If gamma is 0.9, then a reward of 10 points one step in the future is worth 9 points to you right now (10 × 0.9). A reward of 10 points two steps in the future is worth 8.1 points right now (10 × 0.9 × 0.9). The farther in the future a reward is, the less it influences your decision right now.

The agent’s goal is to learn a policy — a strategy that maps states to actions. A policy tells the agent, “When you’re in this state, you should take this action.” In the beginning, the policy is essentially random. But as the agent learns from experience, the policy improves. It starts to encode useful strategies: “When I see an enemy, move away” or “When I see a power-up, move toward it” or “When I’m near the goal, take the action that gets me there fastest.”

There are different ways to learn this policy, but one of the most intuitive approaches is called Q-learning, which we’ll implement in detail in a later post. Q-learning works by learning the “quality” or “value” of taking each action in each state. The Q-value Q(s, a) answers the question: “If I’m in state s and I take action a, how much total reward can I expect to receive from now until the end?” By learning these Q-values, the agent can always choose the action with the highest Q-value, which is the action expected to lead to the best long-term outcome.

One of the most fascinating challenges in reinforcement learning is the exploration-exploitation tradeoff, a problem that exists not just in AI but in human decision-making as well. Imagine you’ve found a restaurant you really like. Every time you go there, you have a great meal. But there are dozens of other restaurants nearby you’ve never tried. Do you keep going to your favorite restaurant (exploitation) or try new ones (exploration)? If you only exploit, you might miss out on an even better restaurant. If you only explore, you’ll never settle down and enjoy consistently good meals.

Reinforcement learning agents face this exact dilemma at every decision point. Should the agent take the action it currently believes is best based on what it’s learned so far (exploitation), or should it try a different action to see if it might discover something even better (exploration)? This is fundamental to whether learning will succeed or fail.

If an agent only exploits, it becomes stuck in what we call a local optimum. Imagine an agent learning to navigate a maze. It might discover a path that leads to a small reward of 5 points. If it only exploits this known path, it will never discover the path on the other side of the maze that leads to a reward of 100 points. The agent has found something that works, but not the best possible solution. It’s like thinking the first restaurant you tried is the best in the city simply because you never tried the others. “Exploiting is like thinking the first restaurant you tried is the best in the city simply because you never tried the others.”

On the other hand, if an agent only explores and never exploits, it never benefits from what it’s learned. It keeps trying random actions even after it’s discovered good strategies. This is inefficient and can prevent the agent from ever converging on a good solution. Imagine if every single time you went out to eat, you picked a completely random restaurant, even after you’d tried hundreds of them. You’d never build up a list of favorites or avoid places you know are bad.

The solution to this dilemma is to do both: explore when you’re uncertain or when you haven’t learned much yet, and exploit increasingly as you become more confident in your knowledge. One of the popular approaches is called ** epsilon-greedy exploration**. The idea is straightforward: most of the time (with probability 1 — ε), choose the action you currently believe is best. But with a small probability ε (epsilon), choose a completely random action instead. This ensures that the agent mostly does what it thinks is right but occasionally tries something new.

“Explore when you’re uncertain or when you haven’t learned much yet.

Exploit increasingly as you become more confident in your knowledge.”

Typically, we start with a high epsilon value (like 0.9 or 1.0) at the beginning of training, meaning the agent explores almost randomly. As training progresses and the agent learns more, we gradually decrease epsilon (a process called epsilon decay) so that the agent explores less and exploits its knowledge more. By the end of training, epsilon might be as low as 0.01, meaning the agent only explores 1% of the time and follows its learned policy 99% of the time.

The following posts in this series will go into more details on implementing these concepts.

Understanding Reinforcement Learning — A Primer was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async Building Long-Running Claude Managed Agents: Why State Matters More Than Compute The Building Blocks of LangGraph (Part 0)

Understanding Reinforcement Learning — A Primer

Run your AI side-project on zahid.host