Optimize Random Forest regressor due to computational limits












0














Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?



I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.



I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.



clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)



train_x.info() shows 3557572 records taking up almost 542 MB of memory




I'm still getting started with ML and any help would be appreciated. Thank you!










share|improve this question



























    0














    Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?



    I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.



    I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.



    clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
    clf.fit(train_X, train_y)
    pred = clf.predict(val_X)



    train_x.info() shows 3557572 records taking up almost 542 MB of memory




    I'm still getting started with ML and any help would be appreciated. Thank you!










    share|improve this question

























      0












      0








      0







      Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?



      I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.



      I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.



      clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
      clf.fit(train_X, train_y)
      pred = clf.predict(val_X)



      train_x.info() shows 3557572 records taking up almost 542 MB of memory




      I'm still getting started with ML and any help would be appreciated. Thank you!










      share|improve this question













      Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?



      I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.



      I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.



      clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
      clf.fit(train_X, train_y)
      pred = clf.predict(val_X)



      train_x.info() shows 3557572 records taking up almost 542 MB of memory




      I'm still getting started with ML and any help would be appreciated. Thank you!







      python machine-learning scikit-learn random-forest kaggle






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 at 20:02









      specbug

      7719




      7719
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.



          Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:




          1. The Number of Attributes (features) in Dataset.

          2. The Number of Trees (n_estimators).

          3. The Maximum Depth of the Tree (max_depth).

          4. The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).


          Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:




          The default values for the parameters controlling the size of the
          trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
          and unpruned trees which can potentially be very large on some data
          sets. To reduce memory consumption, the complexity and size of the
          trees should be controlled by setting those parameter values.






          What to Do?



          There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).



          Rather you need to change the value of the above mentioned parameters, for example:




          1. Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).


          2. Try to reduce the number of estimators.


          3. max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


          4. min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.



          So try to change the the parameters by understanding their effects on the performance, the reference you need is this.




          1. The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!




          Side-Note:



          Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!






          share|improve this answer



















          • 1




            Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
            – specbug
            Nov 24 at 6:43










          • @specbug Glad I could help :)
            – Yahya
            Nov 24 at 10:16











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437426%2foptimize-random-forest-regressor-due-to-computational-limits%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.



          Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:




          1. The Number of Attributes (features) in Dataset.

          2. The Number of Trees (n_estimators).

          3. The Maximum Depth of the Tree (max_depth).

          4. The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).


          Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:




          The default values for the parameters controlling the size of the
          trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
          and unpruned trees which can potentially be very large on some data
          sets. To reduce memory consumption, the complexity and size of the
          trees should be controlled by setting those parameter values.






          What to Do?



          There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).



          Rather you need to change the value of the above mentioned parameters, for example:




          1. Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).


          2. Try to reduce the number of estimators.


          3. max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


          4. min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.



          So try to change the the parameters by understanding their effects on the performance, the reference you need is this.




          1. The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!




          Side-Note:



          Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!






          share|improve this answer



















          • 1




            Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
            – specbug
            Nov 24 at 6:43










          • @specbug Glad I could help :)
            – Yahya
            Nov 24 at 10:16
















          0














          Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.



          Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:




          1. The Number of Attributes (features) in Dataset.

          2. The Number of Trees (n_estimators).

          3. The Maximum Depth of the Tree (max_depth).

          4. The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).


          Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:




          The default values for the parameters controlling the size of the
          trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
          and unpruned trees which can potentially be very large on some data
          sets. To reduce memory consumption, the complexity and size of the
          trees should be controlled by setting those parameter values.






          What to Do?



          There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).



          Rather you need to change the value of the above mentioned parameters, for example:




          1. Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).


          2. Try to reduce the number of estimators.


          3. max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


          4. min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.



          So try to change the the parameters by understanding their effects on the performance, the reference you need is this.




          1. The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!




          Side-Note:



          Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!






          share|improve this answer



















          • 1




            Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
            – specbug
            Nov 24 at 6:43










          • @specbug Glad I could help :)
            – Yahya
            Nov 24 at 10:16














          0












          0








          0






          Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.



          Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:




          1. The Number of Attributes (features) in Dataset.

          2. The Number of Trees (n_estimators).

          3. The Maximum Depth of the Tree (max_depth).

          4. The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).


          Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:




          The default values for the parameters controlling the size of the
          trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
          and unpruned trees which can potentially be very large on some data
          sets. To reduce memory consumption, the complexity and size of the
          trees should be controlled by setting those parameter values.






          What to Do?



          There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).



          Rather you need to change the value of the above mentioned parameters, for example:




          1. Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).


          2. Try to reduce the number of estimators.


          3. max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


          4. min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.



          So try to change the the parameters by understanding their effects on the performance, the reference you need is this.




          1. The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!




          Side-Note:



          Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!






          share|improve this answer














          Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.



          Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:




          1. The Number of Attributes (features) in Dataset.

          2. The Number of Trees (n_estimators).

          3. The Maximum Depth of the Tree (max_depth).

          4. The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).


          Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:




          The default values for the parameters controlling the size of the
          trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
          and unpruned trees which can potentially be very large on some data
          sets. To reduce memory consumption, the complexity and size of the
          trees should be controlled by setting those parameter values.






          What to Do?



          There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).



          Rather you need to change the value of the above mentioned parameters, for example:




          1. Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).


          2. Try to reduce the number of estimators.


          3. max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


          4. min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.



          So try to change the the parameters by understanding their effects on the performance, the reference you need is this.




          1. The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!




          Side-Note:



          Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 23 at 20:25

























          answered Nov 23 at 20:20









          Yahya

          3,5192828




          3,5192828








          • 1




            Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
            – specbug
            Nov 24 at 6:43










          • @specbug Glad I could help :)
            – Yahya
            Nov 24 at 10:16














          • 1




            Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
            – specbug
            Nov 24 at 6:43










          • @specbug Glad I could help :)
            – Yahya
            Nov 24 at 10:16








          1




          1




          Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
          – specbug
          Nov 24 at 6:43




          Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
          – specbug
          Nov 24 at 6:43












          @specbug Glad I could help :)
          – Yahya
          Nov 24 at 10:16




          @specbug Glad I could help :)
          – Yahya
          Nov 24 at 10:16


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437426%2foptimize-random-forest-regressor-due-to-computational-limits%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          What visual should I use to simply compare current year value vs last year in Power BI desktop

          How to ignore python UserWarning in pytest?

          Alexandru Averescu