Feature importance determination and correlation

I want to know which of my varibles have the strongest effect on SalePrice
in my DataFrame df_train.

   Id  MSSubClass MSZoning    ...     SaleType  SaleCondition SalePrice

0   1          60       RL    ...           WD         Normal    208500

1   2          20       RL    ...           WD         Normal    181500

2   3          60       RL    ...           WD         Normal    223500

3   4          70       RL    ...           WD        Abnorml    140000

4   5          60       RL    ...           WD         Normal    250000

For this purpose, I have analized correlation,as well as feature_importances_ of sklearn.
The code for correlation and visualization, with heatmap, is:

corrmat = df_train.corr()

k = 20 #number of variables for heatmap

cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index

cm = np.corrcoef(df_train[cols].values.T)

sns.set(font_scale=1.25)

hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)

plt.show()

And for feature importance determination is:

feature_labels = np.array(['OverallQual', 'GrLivArea', 'SimplOverallQual', 'ExterQual', 'GarageCars', 'KitchenQual', 'SimplExterQual', 'GarageArea', 'SimplKitchenQual', 'TotalBsmtSF', 'FullBath', 'YearBuilt', '1stFlrSF', 'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces', 'HeatingQC', 'LotArea', 'MasVnrArea']) importance = model.feature_importances_ feature_indexes_by_importance = importance.argsort()



indices = np.argsort(importance)[::-1] for index in feature_indexes_by_importance:

    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

'OverallQual', 'GrLivArea' and 'SimplQual'are the most correlated variables with SalePrice according to heatmap.
And according to feature importance most important ones are:

GarageArea-9.71% 



GrLivArea-15.43%



LotArea-17.46%

What is the problem that could explain why correlation and feature_importances_ of sklearn don´t correlate?
Thanks

asked Nov 22 at 18:01

Ley

193

How are these features correlated among themselves?
– Vivek Kumar
Nov 23 at 8:41

add a comment |

I want to know which of my varibles have the strongest effect on SalePrice
in my DataFrame df_train.

   Id  MSSubClass MSZoning    ...     SaleType  SaleCondition SalePrice

0   1          60       RL    ...           WD         Normal    208500

1   2          20       RL    ...           WD         Normal    181500

2   3          60       RL    ...           WD         Normal    223500

3   4          70       RL    ...           WD        Abnorml    140000

4   5          60       RL    ...           WD         Normal    250000

For this purpose, I have analized correlation,as well as feature_importances_ of sklearn.
The code for correlation and visualization, with heatmap, is:

corrmat = df_train.corr()

k = 20 #number of variables for heatmap

cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index

cm = np.corrcoef(df_train[cols].values.T)

sns.set(font_scale=1.25)

hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)

plt.show()

And for feature importance determination is:

feature_labels = np.array(['OverallQual', 'GrLivArea', 'SimplOverallQual', 'ExterQual', 'GarageCars', 'KitchenQual', 'SimplExterQual', 'GarageArea', 'SimplKitchenQual', 'TotalBsmtSF', 'FullBath', 'YearBuilt', '1stFlrSF', 'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces', 'HeatingQC', 'LotArea', 'MasVnrArea']) importance = model.feature_importances_ feature_indexes_by_importance = importance.argsort()



indices = np.argsort(importance)[::-1] for index in feature_indexes_by_importance:

    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

'OverallQual', 'GrLivArea' and 'SimplQual'are the most correlated variables with SalePrice according to heatmap.
And according to feature importance most important ones are:

GarageArea-9.71% 



GrLivArea-15.43%



LotArea-17.46%

What is the problem that could explain why correlation and feature_importances_ of sklearn don´t correlate?
Thanks

asked Nov 22 at 18:01

Ley

193

How are these features correlated among themselves?
– Vivek Kumar
Nov 23 at 8:41

add a comment |

I want to know which of my varibles have the strongest effect on SalePrice
in my DataFrame df_train.

   Id  MSSubClass MSZoning    ...     SaleType  SaleCondition SalePrice

0   1          60       RL    ...           WD         Normal    208500

1   2          20       RL    ...           WD         Normal    181500

2   3          60       RL    ...           WD         Normal    223500

3   4          70       RL    ...           WD        Abnorml    140000

4   5          60       RL    ...           WD         Normal    250000

For this purpose, I have analized correlation,as well as feature_importances_ of sklearn.
The code for correlation and visualization, with heatmap, is:

corrmat = df_train.corr()

k = 20 #number of variables for heatmap

cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index

cm = np.corrcoef(df_train[cols].values.T)

sns.set(font_scale=1.25)

hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)

plt.show()

And for feature importance determination is:

feature_labels = np.array(['OverallQual', 'GrLivArea', 'SimplOverallQual', 'ExterQual', 'GarageCars', 'KitchenQual', 'SimplExterQual', 'GarageArea', 'SimplKitchenQual', 'TotalBsmtSF', 'FullBath', 'YearBuilt', '1stFlrSF', 'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces', 'HeatingQC', 'LotArea', 'MasVnrArea']) importance = model.feature_importances_ feature_indexes_by_importance = importance.argsort()



indices = np.argsort(importance)[::-1] for index in feature_indexes_by_importance:

    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

'OverallQual', 'GrLivArea' and 'SimplQual'are the most correlated variables with SalePrice according to heatmap.
And according to feature importance most important ones are:

GarageArea-9.71% 



GrLivArea-15.43%



LotArea-17.46%

What is the problem that could explain why correlation and feature_importances_ of sklearn don´t correlate?
Thanks

asked Nov 22 at 18:01

Ley

193

I want to know which of my varibles have the strongest effect on SalePrice
in my DataFrame df_train.

   Id  MSSubClass MSZoning    ...     SaleType  SaleCondition SalePrice

0   1          60       RL    ...           WD         Normal    208500

1   2          20       RL    ...           WD         Normal    181500

2   3          60       RL    ...           WD         Normal    223500

3   4          70       RL    ...           WD        Abnorml    140000

4   5          60       RL    ...           WD         Normal    250000

For this purpose, I have analized correlation,as well as feature_importances_ of sklearn.
The code for correlation and visualization, with heatmap, is:

corrmat = df_train.corr()

k = 20 #number of variables for heatmap

cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index

cm = np.corrcoef(df_train[cols].values.T)

sns.set(font_scale=1.25)

hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)

plt.show()

And for feature importance determination is:

feature_labels = np.array(['OverallQual', 'GrLivArea', 'SimplOverallQual', 'ExterQual', 'GarageCars', 'KitchenQual', 'SimplExterQual', 'GarageArea', 'SimplKitchenQual', 'TotalBsmtSF', 'FullBath', 'YearBuilt', '1stFlrSF', 'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces', 'HeatingQC', 'LotArea', 'MasVnrArea']) importance = model.feature_importances_ feature_indexes_by_importance = importance.argsort()



indices = np.argsort(importance)[::-1] for index in feature_indexes_by_importance:

    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

'OverallQual', 'GrLivArea' and 'SimplQual'are the most correlated variables with SalePrice according to heatmap.
And according to feature importance most important ones are:

GarageArea-9.71% 



GrLivArea-15.43%



LotArea-17.46%

What is the problem that could explain why correlation and feature_importances_ of sklearn don´t correlate?
Thanks

python heatmap correlation feature-selection

asked Nov 22 at 18:01

Ley

193

asked Nov 22 at 18:01

Ley

193

asked Nov 22 at 18:01

Ley

193

asked Nov 22 at 18:01

Ley

193

asked Nov 22 at 18:01

Ley

193

How are these features correlated among themselves?
– Vivek Kumar
Nov 23 at 8:41

add a comment |

How are these features correlated among themselves?
– Vivek Kumar
Nov 23 at 8:41

How are these features correlated among themselves?
– Vivek Kumar
Nov 23 at 8:41

add a comment |

1 Answer
1

active

oldest

votes

I suppose you are talking about forest of trees feature_importances_? (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)

Correlation measures a linear correlation between the features and your output, random forest use non linear classification that have nothing to do with linear correlation, and will be able to extract the features that non linearly have the most importance in the task.

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53436215%2ffeature-importance-determination-and-correlation%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I suppose you are talking about forest of trees feature_importances_? (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

add a comment |

I suppose you are talking about forest of trees feature_importances_? (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

add a comment |

I suppose you are talking about forest of trees feature_importances_? (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

I suppose you are talking about forest of trees feature_importances_? (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

answered Nov 22 at 18:05

Matthieu Brucher

11.8k22137

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi