Optimize Random Forest regressor due to computational limits

Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?

I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.

I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.

clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)

clf.fit(train_X, train_y)

pred = clf.predict(val_X)

train_x.info() shows 3557572 records taking up almost 542 MB of memory

I'm still getting started with ML and any help would be appreciated. Thank you!

asked Nov 22 at 20:02

specbug

7719

add a comment |

I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.

I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.

clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)

clf.fit(train_X, train_y)

pred = clf.predict(val_X)

train_x.info() shows 3557572 records taking up almost 542 MB of memory

I'm still getting started with ML and any help would be appreciated. Thank you!

asked Nov 22 at 20:02

specbug

7719

add a comment |

I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.

I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.

clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)

clf.fit(train_X, train_y)

pred = clf.predict(val_X)

train_x.info() shows 3557572 records taking up almost 542 MB of memory

I'm still getting started with ML and any help would be appreciated. Thank you!

asked Nov 22 at 20:02

specbug

7719

I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.

I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.

clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)

clf.fit(train_X, train_y)

pred = clf.predict(val_X)

train_x.info() shows 3557572 records taking up almost 542 MB of memory

I'm still getting started with ML and any help would be appreciated. Thank you!

python machine-learning scikit-learn random-forest kaggle

asked Nov 22 at 20:02

specbug

7719

asked Nov 22 at 20:02

specbug

7719

asked Nov 22 at 20:02

specbug

7719

asked Nov 22 at 20:02

specbug

7719

asked Nov 22 at 20:02

specbug

7719

add a comment |

1 Answer
1

active

oldest

votes

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.

Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:

The Number of Attributes (features) in Dataset.

The Number of Trees (n_estimators).

The Maximum Depth of the Tree (max_depth).

The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).

Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:

The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.

What to Do?

There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).

Rather you need to change the value of the above mentioned parameters, for example:

Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).

Try to reduce the number of estimators.

max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

So try to change the the parameters by understanding their effects on the performance, the reference you need is this.

The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!

Side-Note:

Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!

edited Nov 23 at 20:25

answered Nov 23 at 20:20

Yahya

3,5192828

1

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43

@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437426%2foptimize-random-forest-regressor-due-to-computational-limits%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.

Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:

The Number of Attributes (features) in Dataset.

The Number of Trees (n_estimators).

The Maximum Depth of the Tree (max_depth).

The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).

Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:

The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.

What to Do?

There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).

Rather you need to change the value of the above mentioned parameters, for example:

Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).

Try to reduce the number of estimators.

max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

So try to change the the parameters by understanding their effects on the performance, the reference you need is this.

The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!

Side-Note:

edited Nov 23 at 20:25

answered Nov 23 at 20:20

Yahya

3,5192828

1

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43

@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16

add a comment |

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.

Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:

The Number of Attributes (features) in Dataset.

The Number of Trees (n_estimators).

The Maximum Depth of the Tree (max_depth).

The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).

Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:

The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.

What to Do?

There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).

Rather you need to change the value of the above mentioned parameters, for example:

Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).

Try to reduce the number of estimators.

max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

So try to change the the parameters by understanding their effects on the performance, the reference you need is this.

The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!

Side-Note:

edited Nov 23 at 20:25

answered Nov 23 at 20:20

Yahya

3,5192828

1

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43

@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16

add a comment |

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.

Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:

The Number of Attributes (features) in Dataset.

The Number of Trees (n_estimators).

The Maximum Depth of the Tree (max_depth).

The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).

Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:

The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.

What to Do?

There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).

Rather you need to change the value of the above mentioned parameters, for example:

Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).

Try to reduce the number of estimators.

max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

So try to change the the parameters by understanding their effects on the performance, the reference you need is this.

The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!

Side-Note:

edited Nov 23 at 20:25

answered Nov 23 at 20:20

Yahya

3,5192828

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.

Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:

The Number of Attributes (features) in Dataset.

The Number of Trees (n_estimators).

The Maximum Depth of the Tree (max_depth).

The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).

Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:

The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.

What to Do?

There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).

Rather you need to change the value of the above mentioned parameters, for example:

Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).

Try to reduce the number of estimators.

max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

So try to change the the parameters by understanding their effects on the performance, the reference you need is this.

The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!

Side-Note:

edited Nov 23 at 20:25

answered Nov 23 at 20:20

Yahya

3,5192828

edited Nov 23 at 20:25

answered Nov 23 at 20:20

Yahya

3,5192828

answered Nov 23 at 20:20

Yahya

3,5192828

answered Nov 23 at 20:20

Yahya

3,5192828

1

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43

@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16

add a comment |

1

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43

@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the n_estimators greatly! But will keep the above in mind while using RF from Scikit-learn.
– specbug
Nov 24 at 6:43

@specbug Glad I could help :)
– Yahya
Nov 24 at 10:16

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi