Model-free reinforcement learning is a reinforcement learning algorithm that does not use the transition probability distribution and the reward function of the Markov decision process that represents the problem that requires solving.
In model-free reinforcement learning, the transition probability distribution or transition model and the reward function are collectively called the “model.” Their absence in the process is why the name “model-free.”
Since the reinforcement learning algorithm does not use a model, you can think of it as a trial-and-error algorithm. The machine tries all possible solutions until it gets the best result.
Read More about Model-Free Reinforcement Learning
To better understand what model-free reinforcement learning is, let us compare it with model-based reinforcement learning.
How Do Model-Based and Model-Free Reinforcement Learning Differ?
Model-based, as opposed to model-free reinforcement learning, has an agent that tries to understand its environment and creates a model for it based on its interactions with this environment. The machine will always try to get the maximum reward no matter what its action may cause.
In model-free reinforcement learning, meanwhile, the machine will always act several times and adjust its strategy for optimal rewards based on outcomes.
Simply put, if the machine can predict the reward for an action before doing it or plans what it should do, its algorithm is model-based. But if it needs to act to see what happens and learn from it, its algorithm is model-free.
What Is the Origin of Model-Free Reinforcement Learning?
Model-free reinforcement learning’s origin can be traced back to the late 19th century when Edward Thorndike proposed the “law of effect.” The law states that “actions with positive effects in a particular situation become more likely to occur again in that situation, and responses that produce negative effects become less likely to occur in the future.”
Thorndike explored the law of effect using an experiment. He placed a cat inside a puzzle box and measured how much time it took for the cat to escape. The cat had to use various gadgets, such as strings and levers.
Thorndike found that as the cat interacted more with the puzzle box, it learned what behavioral responses could help it escape. Over time, the cat grew more adept at escaping. As such, he concluded that the cat learned from the rewards and punishments its actions provided.
The law of effect paved the way for behaviorism—a branch of psychology that attempts to explain human and animal behaviors in terms of stimuli and responses. It is also the foundation of model-free reinforcement learning, where an agent (cat) perceives the world (puzzle box), takes action (pulls a lever, for example), and measures the reward (escape). The agent performs random actions and gradually repeats those that lead to more rewards.
In model-free reinforcement learning, the agent or machine has no direct knowledge or model of the world. It must directly experience every outcome of each action through trial and error.
What Are Examples of Model-Free Reinforcement Learning Algorithms?
Some popular model-free reinforcement learning algorithms are:
- Deep Q Network (DQN): DQN is an algorithm that teaches a machine via deep Q learning. Typically used in games, DQN allows an AI player to become more like a human player to achieve a desired outcome—to beat the opponent.
- Deep Deterministic Policy Gradient (DDPG): DDPG can be applied to an acrobat’s arm with two joints to track a balloon. As the balloon moves, the machine adjusts the two joints to track the balloon.
- Asynchronous Advantage Actor-Critic Algorithm (A3C): A3C is said to be a state-of-the-art reinforcement learning algorithm. It beat DQN in the Atari domain based on a Google Deep Mind paper. It can also be used in experiments that involve some kind of global network optimization of different environments for generalization purposes.
- Trust Region Policy Optimization (TRPO): TRPO works by updating policies after taking the most significant step possible to enhance performance while satisfying a special constraint indicating how close the new and old policies can be.
- Proximal Policy Optimization (PPO): PPO is another state-of-the-art reinforcement learning algorithm introduced by OpenAI in 2017. It is said to strike the right balance between performance and comprehension.
- Twin Delayed Deep Deterministic Policy Gradient (TD3): A TD3 agent looks for an optimal policy to maximize the expected cumulative long-term reward.
- Soft Actor-Critic (SAC): SAC, another state-of-the-art reinforcement learning algorithm, was jointly developed by UC Berkeley and Google. It is considered one of the most efficient algorithms for real-world robotics.
- Model-free reinforcement learning is a reinforcement learning algorithm that does not use the transition probability distribution and the reward function of the Markov decision process that represents the problem that requires solving.
- If an AI system can predict the reward for an action before doing it or plans what it should do, its algorithm is model-based. But if it needs to act to see what happens and learn from it, its algorithm is model-free.
- Edward Thorndike proposed the “law of effect” in the late 19th century and is considered the originator of model-free reinforcement learning.