What Is a Markov Decision Process?

Written by Coursera Staff • Updated on Mar 18, 2025

Learn about the Markov decision process (MDP), a stochastic decision-making process that undergirds reinforcement learning, machine learning, and artificial intelligence.

[Feature Image] Two learners discuss the topic, “What is a Markov decision process,” as they plan how to use it to aid in a project.

In machine learning, Markov decision processes (MDPs) are models for making optimal decisions where the result is random. This modeling process comes from the theorization of Markov chains, which are discrete-time stochastic processes depicted by the Markov property. MDPs are important models for reinforcement learning, a core process for artificial intelligence and machine learning that robotics, autonomous vehicles, and other advanced automatic systems use.

Explore Markov decision processes, their core concepts, their uses, and their applications in many industries.

What is a Markov decision process?

Many different components make up a Markov decision process, the first being the Markov property, which establishes that the conditions of all future states are dependent only on their current states and not their previous ones. In an MDP, an agent is a decision-maker that executes the actions in a system in an attempt to optimize desired system performance. The agent makes decisions throughout specific points in time, known as decision epochs, as it attempts to optimize performance. At each point in time, the agent incurs a reward or a cost, which affects the future actions the agent takes when making a decision.

Core concepts of MDP

The core concepts of MDP, as described above, take the form of a tuple (S,A, p, r) :

States (S): A state of space in the system; for example, in a vehicle, it could be all positions the vehicle can move

Actions (A): A set of actions; for example, all ways a vehicle can move when you turn the wheel, move forward, reverse, or stop

Transition probabilities (p): The probability of state transitions, which describes the distribution of states over a set number of actions depending on what action occurs in which state

Reward function (r): The reward function describes the cost or reward of performing an action in a specific state

The Markov decision process framework

When the agent decides which action to take in the Markov decision process framework, it must do so under a predetermined policy. In MDPs, policies are rules that agents follow when making decisions. Two kinds of policy classifications exist in MDPs:

Stationary: A stationary policy is static in that the decision remains the same when a given state presents itself. For example, if you are playing poker, you could set a policy in which you always bet five dollars if you are dealt a pair.

Nonstationary: A nonstationary policy allows multiple actions to occur in a single state. The action taken depends upon the specific instance of time or decision epoch that the system is in.

As you progress through an MDP framework, you want each decision, an action taken, to change the current state into a new state based on the policies in place until you reach the last stage. Much of the work in this framework is on optimizing the number of steps it takes to reach the final stage.

Solving Markov decision processes using dynamic programming

To solve a Markov decision process problem, you need to find the optimal policy, the one that will give you the best results. Since the optimal policy is the one that gives the best return or reward, to find the optimal policy for a given state, you need to find the returns for the agent at every state. This function is known as the value function. You use the Bellman equation for the value function to find the needed optimization steps. The Bellman equation splits the value function into two parts:

Immediate reward: The expected reward the agent receives when leaving a state
Discounted value: The value of the successor state the agent moves to

Once you decompose the value function into the Bellman equation, you subject the Bellman equation to a specific policy, which means the value function depends on the policy. Solving this equation is a core aspect of dynamic programming, which solves multi-step optimization problems using recursive algorithms.

Value iteration and policy iteration algorithms

Two popular algorithms use the Bellman equation to find the optimal policy for a system. They are as follows:

Value iteration: Calculates the optimal value function, then finds the optimal policy from the final determined result of the optimal value function

Policy iteration: Evaluates the optimal policy by randomly evaluating the value function; it does this by making the locally optimal choice until convergence using the Bellman equation

Reinforcement learning

Reinforcement learning (RL) is an important part of autonomous machine learning algorithms based on the Markov decision process. The reinforcement learning agent follows the process of the MDP as the agent explores the state space (consisting of all possible states) and the action space (made up of all possible actions it can take). As it explores, the RL agent receives rewards for making optimal decisions and remembers to make those decisions again when in a similar future state. The RL agent eventually learns how to operate in this environment as it meets the goals over time.

The RL agent learns certain actions as it receives rewards for choosing those actions but still maintains the exploration of new states and actions. As it does this, it improves decision-making by balancing its exploitation of previously learned knowledge and its exploration of new states.

Real-world applications of MDPs

MDPs and reinforcement learning have many real-world applications that use dynamic programming and recursive algorithms. Some of the applications include:

Robotics: Robotics use deep reinforcement learning for complex movement, decision-making, and sensory input.

Natural language processing: Deep reinforcement learning trains large language models like chatbots.

Autonomous vehicle decision-making: Reinforcement learning trains autonomous vehicles to respond like a human driver.

Financial investment and insurance: MDPs help analyze current investment practices based on previous decisions.

Maintenance and repair of equipment: MDPs create models for evaluating the problem in a machine and prescribing whether maintenance or replacement is better over time.

Epidemics and public health: MDPs can help model epidemic outbreaks and make decisions based on the number of infections present at a given time.

Learn more about Markov decision processes with Coursera

Markov decision processes are a key component of reinforcement learning and help in the creation of machine learning and artificial intelligence. If you want to gain in-demand skills in machine learning, try the Machine Learning Specialization from Stanford and DeepLearning.AI on Coursera. Also, try the IBM AI Engineering Professional Certificate to help you build practical AI experience.

Keep reading

Updated on Mar 18, 2025

Written by:

Coursera Staff

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.