Q-learning vs dynamic programming











up vote
4
down vote

favorite
2












Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?










share|improve this question




























    up vote
    4
    down vote

    favorite
    2












    Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?










    share|improve this question


























      up vote
      4
      down vote

      favorite
      2









      up vote
      4
      down vote

      favorite
      2






      2





      Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?










      share|improve this question















      Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?







      machine-learning dynamic-programming reinforcement-learning q-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 22 at 16:01









      nbro

      5,51584692




      5,51584692










      asked Aug 17 '16 at 5:16









      D_Wills

      7818




      7818
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          4
          down vote



          accepted










          Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



          Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



          Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



          Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






          share|improve this answer






























            up vote
            10
            down vote













            From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




            The term dynamic programming (DP) refers to a collection of algorithms
            that can be used to compute optimal policies given a perfect model of
            the environment as a Markov decision process (MDP). Classical DP
            algorithms are of limited utility in reinforcement learning both
            because of their assumption of a perfect model and because of their
            great computational expense, but they are still important
            theoretically.




            So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



            On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



            Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



            NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






            share|improve this answer






























              up vote
              -1
              down vote













              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






              share|improve this answer

















              • 1




                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                – cel
                Aug 17 '16 at 6:16










              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                – Don Reba
                Aug 17 '16 at 11:22











              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38988582%2fq-learning-vs-dynamic-programming%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              4
              down vote



              accepted










              Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



              Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



              Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



              Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






              share|improve this answer



























                up vote
                4
                down vote



                accepted










                Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



                Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



                Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



                Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






                share|improve this answer

























                  up vote
                  4
                  down vote



                  accepted







                  up vote
                  4
                  down vote



                  accepted






                  Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



                  Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



                  Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



                  Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






                  share|improve this answer














                  Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



                  Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



                  Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



                  Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 16 '17 at 15:54

























                  answered Nov 16 '17 at 15:48









                  mimoralea

                  6,01543846




                  6,01543846
























                      up vote
                      10
                      down vote













                      From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                      The term dynamic programming (DP) refers to a collection of algorithms
                      that can be used to compute optimal policies given a perfect model of
                      the environment as a Markov decision process (MDP). Classical DP
                      algorithms are of limited utility in reinforcement learning both
                      because of their assumption of a perfect model and because of their
                      great computational expense, but they are still important
                      theoretically.




                      So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                      On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                      Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                      NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






                      share|improve this answer



























                        up vote
                        10
                        down vote













                        From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                        The term dynamic programming (DP) refers to a collection of algorithms
                        that can be used to compute optimal policies given a perfect model of
                        the environment as a Markov decision process (MDP). Classical DP
                        algorithms are of limited utility in reinforcement learning both
                        because of their assumption of a perfect model and because of their
                        great computational expense, but they are still important
                        theoretically.




                        So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                        On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                        Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                        NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






                        share|improve this answer

























                          up vote
                          10
                          down vote










                          up vote
                          10
                          down vote









                          From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                          The term dynamic programming (DP) refers to a collection of algorithms
                          that can be used to compute optimal policies given a perfect model of
                          the environment as a Markov decision process (MDP). Classical DP
                          algorithms are of limited utility in reinforcement learning both
                          because of their assumption of a perfect model and because of their
                          great computational expense, but they are still important
                          theoretically.




                          So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                          On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                          Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                          NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






                          share|improve this answer














                          From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                          The term dynamic programming (DP) refers to a collection of algorithms
                          that can be used to compute optimal policies given a perfect model of
                          the environment as a Markov decision process (MDP). Classical DP
                          algorithms are of limited utility in reinforcement learning both
                          because of their assumption of a perfect model and because of their
                          great computational expense, but they are still important
                          theoretically.




                          So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                          On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                          Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                          NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 22 at 15:57









                          nbro

                          5,51584692




                          5,51584692










                          answered Aug 17 '16 at 8:24









                          Pablo EM

                          2,59411626




                          2,59411626






















                              up vote
                              -1
                              down vote













                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






                              share|improve this answer

















                              • 1




                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                                – cel
                                Aug 17 '16 at 6:16










                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                                – Don Reba
                                Aug 17 '16 at 11:22















                              up vote
                              -1
                              down vote













                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






                              share|improve this answer

















                              • 1




                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                                – cel
                                Aug 17 '16 at 6:16










                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                                – Don Reba
                                Aug 17 '16 at 11:22













                              up vote
                              -1
                              down vote










                              up vote
                              -1
                              down vote









                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






                              share|improve this answer












                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Aug 17 '16 at 5:49









                              Don Reba

                              10.7k23252




                              10.7k23252








                              • 1




                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                                – cel
                                Aug 17 '16 at 6:16










                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                                – Don Reba
                                Aug 17 '16 at 11:22














                              • 1




                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                                – cel
                                Aug 17 '16 at 6:16










                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                                – Don Reba
                                Aug 17 '16 at 11:22








                              1




                              1




                              no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                              – cel
                              Aug 17 '16 at 6:16




                              no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.
                              – cel
                              Aug 17 '16 at 6:16












                              @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                              – Don Reba
                              Aug 17 '16 at 11:22




                              @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.
                              – Don Reba
                              Aug 17 '16 at 11:22


















                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38988582%2fq-learning-vs-dynamic-programming%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              What visual should I use to simply compare current year value vs last year in Power BI desktop

                              Alexandru Averescu

                              Trompette piccolo