# Markov decision process (MDP)

by Stephen M. Walker II, Co-Founder / CEO

## What is a Markov decision process (MDP)?

Imagine you're playing a board game where each move you make is a mix of your choice and a roll of the dice. A Markov decision process is like the rules of the game, guiding your decisions and the random chance to find the best way to win.

A Markov decision process (MDP) is a mathematical framework used for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. MDPs are an extension of Markov chains, which are models for stochastic processes without decision-making. It's a key tool in reinforcement learning and is used to make optimal decisions in scenarios where the results are either random or controlled by a decision maker, which makes sequential decisions over time.

The key difference in MDPs is the addition of actions and rewards, which introduce the concepts of choice and motivation, respectively.

MDPs have five core elements:

- S: a set of possible states for an agent to be in
- A: a set of possible actions an agent can take at a particular state
- R: the rewards for making an action A at state S
- P: the probabilities for transitioning to a new state S' after taking action A at original state S
- Gamma: which controls how far-looking the Markov Decision Process agent will be

MDPs follow the Markov Property, which states that the next state can be determined purely by the current state. They are used in various domains such as computer science, electrical engineering, manufacturing, operations research, finance and economics, telecommunications, and more.

In reinforcement learning, the problem to resolve is described as a Markov Decision Process (MDP). Theoretical results in reinforcement learning rely on the MDP description being a correct match to the problem.

There are several algorithms used to solve MDPs, including policy iteration and value iteration. These algorithms are based on the idea of using value functions to organize and structure the search for good policies.

In real-world applications, MDPs have been used in areas like routing problems, water resource management, robotic control, and more.

## What is a real-world example?

Imagine a delivery drone navigating a bustling cityscape to deliver packages. The drone has to decide at each intersection whether to turn left, right, or continue straight. Turning left might lead to a quicker route but with potential wind hazards, while turning right offers a longer but safer path. Continuing straight could lead to an area with unpredictable weather patterns. The drone receives positive rewards for timely deliveries and negative rewards for delays or damage to the package.

In this MDP scenario, the agent is the delivery drone, and the environment is the city with its dynamic conditions. The drone's actions at each intersection lead to different outcomes or states. For instance, choosing to turn right might avoid the wind hazard but add time to the delivery. Over time, the drone learns the optimal route that balances speed and safety to maximize rewards.

Formalizing this in the context of MDP, the drone develops a strategy to navigate the city efficiently. It learns to avoid certain paths that may lead to negative outcomes, such as package damage or excessive delays. This example demonstrates how MDPs can be applied to real-life situations, guiding autonomous systems in complex, uncertain environments.

## What is the Bellman equation?

The Bellman equation is like a recipe that helps you make the best long-term decisions by considering the immediate and future consequences of your actions. It's a mathematical formula that tells you how to maximize your rewards over time by choosing the right actions now.

The Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It's a recursive equation that breaks a dynamic optimization problem into a sequence of simpler subproblems, following Bellman's “principle of optimality".

In the context of reinforcement learning, the Bellman equation is used to calculate the expected long-term reward for being in a particular state and following a specific policy.

The Bellman equation is a fundamental building block of reinforcement learning, helping to produce more calculated and better-bearing outcomes by merging several reinforcement learning functions. It was first applied to engineering control theory and has since become an important tool in economic theory.

## What is dynamic programming?

Dynamic programming is like a chef who keeps leftovers to use in future meals instead of cooking everything from scratch each time. It's a smart way to solve complex problems by remembering past results, making future calculations faster and easier.

Dynamic programming is a computer programming technique that optimizes algorithmic problems by breaking them down into simpler sub-problems, solving each sub-problem only once, and storing the results for future use. This technique is particularly useful for problems that can be divided into overlapping sub-problems or optimal substructures.

The concept of dynamic programming was introduced by Richard Bellman in the 1950s and is used in various fields, from computer science to economics. It's a method of mathematical optimization and a methodology for computer programming.

Dynamic programming works by storing the results of sub-problems so that when their solutions are needed again, they are readily available and do not need to be recalculated. This technique of storing the value of sub-problems is called memoization. By saving the values, we save time for computations of sub-problems we have already come across.

Dynamic programming can be applied in two ways: top-down and bottom-up. The top-down approach, also known as memoization, starts with the original problem and breaks it down into sub-problems, storing the results for future use. The bottom-up approach, on the other hand, starts with the simplest sub-problems and works its way up to the solution of the original problem.

Dynamic programming is often used to solve optimization problems, where the goal is to find the best solution among a set of possible solutions. It's a powerful technique that can significantly reduce the time complexity of algorithms, transforming them from exponential to polynomial time.

Examples of problems that can be solved using dynamic programming include the Fibonacci sequence, the knapsack problem, and the longest common subsequence problem. It's a versatile skill with applications in various areas, from game development to DevOps.

## Value Iteration and Policy Iteration in AI

Value iteration and policy iteration are two dynamic programming techniques used in artificial intelligence to solve Markov decision processes (MDPs). Value iteration improves the value function estimates iteratively, converging on the optimal values by considering the neighboring states' values. It is particularly useful for determining the best path in a graph or network.

Policy iteration, on the other hand, alternates between evaluating the value function and improving the policy based on that evaluation. It is capable of finding optimal policies for MDPs with large or infinite state spaces, although it can be computationally intensive. This method is typically employed when simpler approaches are ineffective.