What is temporal difference learning?
by Stephen M. Walker II, CoFounder / CEO
What is temporal difference learning?
Temporal Difference (TD) learning is a class of modelfree reinforcement learning methods. These methods sample from the environment, similar to Monte Carlo methods, and perform updates based on current estimates, akin to dynamic programming methods. Unlike Monte Carlo methods, which adjust their estimates only once the final outcome is known, TD methods adjust predictions to match later, more accurate predictions.
TD learning is essentially a way to learn how to predict a quantity that is dependent on future values. It is used to compute the longterm utility of a pattern of behavior from a series of shortterm outcomes. The simplest version of temporaldifference learning is called TD(0) or onestep TD. When transitioning from a state S to a new state S', the TD(0) algorithm computes a backedup value and updates V(S) accordingly. This backedup value is called the TD error, the difference between the optimal value function V_star and the current estimate V(S).
Temporal Difference Learning has a wide range of applications across numerous domains, from robotics and control systems to artificial intelligence and game playing. It also finds applications in neuroscience for understanding dopamine neurons and studying conditions like schizophrenia.
One of the key benefits of Temporal Difference Learning is its efficiency. By updating value estimates using differences between timesteps, Temporal Difference Learning greatly accelerates the learning process. It doesn't require the full details of an episode to be known in advance, unlike Monte Carlo methods, granting it added versatility in practical applications. TD learning methods are able to learn in each step, online or offline, and are capable of learning from incomplete sequences.
What are the benefits of temporal difference learning?
Temporal Difference (TD) learning is a method used in reinforcement learning that offers several benefits:

Learning in Each Step — TD learning methods can learn at every step of an episode, whether online or offline, which allows for continuous improvement and immediate incorporation of new information.

Handling Incomplete Sequences — TD learning can handle incomplete sequences of data, making it suitable for continuous problems where the full trajectory is not known in advance.

NonTerminating Environments — It is capable of functioning in nonterminating environments, which is useful for ongoing tasks without a clear endpoint.

Efficiency — By updating value estimates using differences between timesteps, TD learning accelerates the learning process, leading to more efficient learning.

Adaptability — TD learning adapts to new information while balancing it with known actions, which is crucial in dynamic environments where conditions can change.

Versatility in Applications — The method is applicable across various domains, including robotics, control systems, AI, and game playing, demonstrating its wideranging utility.

Statistical Advantages — TD methods fit value functions by minimizing the degree of temporal inconsistency between estimates, which can lead to dramatic improvements in estimates of the difference in valuetogo for different states.

Learning Without a Perfect Model — TD learning does not require a perfect model of the environment's dynamics, which is beneficial in complex or unpredictable environments.

Combination of Approaches — TD learning combines aspects of Monte Carlo methods, which learn from complete episodes, and Dynamic Programming methods, which bootstrap and update estimates based on other learned estimates.
These benefits make TD learning a powerful tool in the field of machine learning, particularly for problems that involve sequential decisionmaking and where the environment may be partially observable or the model of the environment is unknown.
How does temporal difference learning differ from other reinforcement learning techniques?
Temporal Difference (TD) learning, Qlearning, and modelbased reinforcement learning are all techniques used in reinforcement learning, but they differ in their approaches and use cases.
TD learning is a modelfree method that learns by sampling from the environment and updating estimates based on current predictions. It's particularly effective when the Markov Decision Process (MDP) is known or can be learned but can't be solved. TD learning is efficient and can learn from incomplete sequences, making it versatile in practical applications. It exploits the Markov property, which makes it more effective in Markov environments. However, it has greater sensitivity towards the initial value and is a biased estimation.
Qlearning, on the other hand, is another modelfree method that performs updates only as a function of the seemingly optimal actions, regardless of what action will be chosen. It's best used when the MDP can't be solved. Qlearning is an offpolicy method, meaning it learns the value of the optimal policy irrespective of the policy being followed. This contrasts with onpolicy methods like SARSA (a type of TD learning), which estimate the value of the current policy being used.
Modelbased reinforcement learning, in contrast, attempts to model the environment and use this model to plan the best action. It's best used when the MDP can't be learned. This approach can be more efficient than modelfree methods as it can leverage the model to simulate experiences and learn from them, but it requires a good model of the environment to be effective.
How does temporal difference learning work?
Temporal Difference (TD) learning is a modelfree reinforcement learning technique that learns the value of states or actions by using the differences between consecutive predictions. Here's how it works:

Bootstrapping — TD learning updates estimates based on other learned estimates, without waiting for a final outcome. This is known as bootstrapping and is a key feature that differentiates TD learning from other methods like Monte Carlo, which require the end of an episode to update values.

TD Error — The core of TD learning is the TD error, which is the difference between the predicted reward and the actual reward received plus the estimated value of the next state. This error is used to update the value function.

TD(0) Algorithm — The simplest form of TD learning is TD(0), or onestep TD, which updates the value of the current state based on the reward received for moving to the next state and the estimated value of that next state. The update rule for TD(0) is given by:
V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1})  V(s_t)]
whereV(s_t)
is the current estimate of the state's value,α
is the learning rate,r_{t+1}
is the reward received after transitioning to the next state,γ
is the discount factor, andV(s_{t+1})
is the estimated value of the next state. 
Learning from Experience — TD learning can learn from raw experience without a model of the environment's dynamics. It updates its estimates incrementally after each time step, which allows it to learn online from incomplete sequences.

Policy Evaluation — In the context of a policy
π
, TD learning aims to evaluate the policy by updating the value functionV
to more closely approximate the true value functionV^π
for all states under that policy.
TD learning has been successfully applied to various tasks, including playing games like Atari, Go, and poker, where it has shown to be effective in environments with delayed rewards. TD learning's ability to learn before knowing the final outcome and to update values online makes it a powerful tool for problems where the environment is constantly changing or the agent has to make decisions on the fly.