Optimize Random Forest regressor due to computational limits
Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info()
shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!
python machine-learning scikit-learn random-forest kaggle
add a comment |
Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info()
shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!
python machine-learning scikit-learn random-forest kaggle
add a comment |
Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info()
shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!
python machine-learning scikit-learn random-forest kaggle
Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info()
shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!
python machine-learning scikit-learn random-forest kaggle
python machine-learning scikit-learn random-forest kaggle
asked Nov 22 at 20:02
specbug
7719
7719
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Random Forest
by nature puts a massive load on the CPU
and RAM
and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
- The Number of Attributes (features) in Dataset.
- The Number of Trees (
n_estimators
). - The Maximum Depth of the Tree (
max_depth
). - The Minimum Number of Samples required to be at a Leaf Node (
min_samples_leaf
).
Moreover, it's clearly stated by Scikit-learn
about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g.max_depth
,min_samples_leaf
, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn
did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth
isNone
by default which means the nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split
samples.min_samples_leaf
is1
by default: A split point at any depth will only be considered if it leaves at leastmin_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
- The final and last option you have is to create your own customized
Random Forest
from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7
laptop that setting the parameter n_jobs
to -1
overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None
! Although theoretically speaking it should be the opposite!
1
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing then_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437426%2foptimize-random-forest-regressor-due-to-computational-limits%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Random Forest
by nature puts a massive load on the CPU
and RAM
and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
- The Number of Attributes (features) in Dataset.
- The Number of Trees (
n_estimators
). - The Maximum Depth of the Tree (
max_depth
). - The Minimum Number of Samples required to be at a Leaf Node (
min_samples_leaf
).
Moreover, it's clearly stated by Scikit-learn
about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g.max_depth
,min_samples_leaf
, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn
did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth
isNone
by default which means the nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split
samples.min_samples_leaf
is1
by default: A split point at any depth will only be considered if it leaves at leastmin_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
- The final and last option you have is to create your own customized
Random Forest
from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7
laptop that setting the parameter n_jobs
to -1
overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None
! Although theoretically speaking it should be the opposite!
1
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing then_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
add a comment |
Random Forest
by nature puts a massive load on the CPU
and RAM
and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
- The Number of Attributes (features) in Dataset.
- The Number of Trees (
n_estimators
). - The Maximum Depth of the Tree (
max_depth
). - The Minimum Number of Samples required to be at a Leaf Node (
min_samples_leaf
).
Moreover, it's clearly stated by Scikit-learn
about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g.max_depth
,min_samples_leaf
, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn
did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth
isNone
by default which means the nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split
samples.min_samples_leaf
is1
by default: A split point at any depth will only be considered if it leaves at leastmin_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
- The final and last option you have is to create your own customized
Random Forest
from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7
laptop that setting the parameter n_jobs
to -1
overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None
! Although theoretically speaking it should be the opposite!
1
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing then_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
add a comment |
Random Forest
by nature puts a massive load on the CPU
and RAM
and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
- The Number of Attributes (features) in Dataset.
- The Number of Trees (
n_estimators
). - The Maximum Depth of the Tree (
max_depth
). - The Minimum Number of Samples required to be at a Leaf Node (
min_samples_leaf
).
Moreover, it's clearly stated by Scikit-learn
about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g.max_depth
,min_samples_leaf
, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn
did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth
isNone
by default which means the nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split
samples.min_samples_leaf
is1
by default: A split point at any depth will only be considered if it leaves at leastmin_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
- The final and last option you have is to create your own customized
Random Forest
from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7
laptop that setting the parameter n_jobs
to -1
overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None
! Although theoretically speaking it should be the opposite!
Random Forest
by nature puts a massive load on the CPU
and RAM
and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
- The Number of Attributes (features) in Dataset.
- The Number of Trees (
n_estimators
). - The Maximum Depth of the Tree (
max_depth
). - The Minimum Number of Samples required to be at a Leaf Node (
min_samples_leaf
).
Moreover, it's clearly stated by Scikit-learn
about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g.max_depth
,min_samples_leaf
, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn
did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth
isNone
by default which means the nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split
samples.min_samples_leaf
is1
by default: A split point at any depth will only be considered if it leaves at leastmin_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
- The final and last option you have is to create your own customized
Random Forest
from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7
laptop that setting the parameter n_jobs
to -1
overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None
! Although theoretically speaking it should be the opposite!
edited Nov 23 at 20:25
answered Nov 23 at 20:20
Yahya
3,5192828
3,5192828
1
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing then_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
add a comment |
1
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing then_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
1
1
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the
n_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.– specbug
Nov 24 at 6:43
Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the
n_estimators
greatly! But will keep the above in mind while using RF from Scikit-learn.– specbug
Nov 24 at 6:43
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437426%2foptimize-random-forest-regressor-due-to-computational-limits%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown