Regression with Lots of Categorical Variables

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.

I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.

For examples, are any regressors perhaps more suited to dummy variables?

asked 4 hours ago

Odisseo

124

What do you want to accomplish with this regression?
– Dimitriy V. Masterov
3 hours ago

@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
3 hours ago

How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
3 hours ago

I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
3 hours ago

add a comment |

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.

I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.

For examples, are any regressors perhaps more suited to dummy variables?

asked 4 hours ago

Odisseo

124

What do you want to accomplish with this regression?
– Dimitriy V. Masterov
3 hours ago

@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
3 hours ago

How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
3 hours ago

I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
3 hours ago

add a comment |

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.

I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.

For examples, are any regressors perhaps more suited to dummy variables?

asked 4 hours ago

Odisseo

124

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.

I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.

For examples, are any regressors perhaps more suited to dummy variables?

regression categorical-data categorical-encoding

asked 4 hours ago

Odisseo

124

asked 4 hours ago

Odisseo

124

asked 4 hours ago

Odisseo

124

asked 4 hours ago

Odisseo

124

asked 4 hours ago

Odisseo

124

What do you want to accomplish with this regression?
– Dimitriy V. Masterov
3 hours ago

@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
3 hours ago

How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
3 hours ago

I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
3 hours ago

add a comment |

What do you want to accomplish with this regression?
– Dimitriy V. Masterov
3 hours ago

@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
3 hours ago

How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
3 hours ago

I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
3 hours ago

What do you want to accomplish with this regression?
– Dimitriy V. Masterov
3 hours ago

@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
3 hours ago

How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
3 hours ago

I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
3 hours ago

add a comment |

4 Answers
4

active

oldest

votes

$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.

A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?

Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?

Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.

If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.

Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.

If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.

Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.

answered 3 hours ago

Dave Harris

3,484515

Thank you, great answer!!
– Odisseo
2 hours ago

add a comment |

I will give you a stats answer to an ML question.

First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).

** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.categorical_var notation)

answered 3 hours ago

Yuval Spiegler

1,4141827

Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
3 hours ago

After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
3 hours ago

add a comment |

How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.

answered 3 hours ago

Ana

What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
3 hours ago

add a comment |

Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.

If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.

edited 3 hours ago

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
3 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f386247%2fregression-with-lots-of-categorical-variables%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

answered 3 hours ago

Dave Harris

3,484515

Thank you, great answer!!
– Odisseo
2 hours ago

add a comment |

answered 3 hours ago

Dave Harris

3,484515

Thank you, great answer!!
– Odisseo
2 hours ago

add a comment |

answered 3 hours ago

Dave Harris

3,484515

answered 3 hours ago

Dave Harris

3,484515

answered 3 hours ago

Dave Harris

3,484515

answered 3 hours ago

Dave Harris

3,484515

answered 3 hours ago

Dave Harris

3,484515

Thank you, great answer!!
– Odisseo
2 hours ago

add a comment |

Thank you, great answer!!
– Odisseo
2 hours ago

Thank you, great answer!!
– Odisseo
2 hours ago

add a comment |

I will give you a stats answer to an ML question.

answered 3 hours ago

Yuval Spiegler

1,4141827

Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
3 hours ago

After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
3 hours ago

add a comment |

I will give you a stats answer to an ML question.

answered 3 hours ago

Yuval Spiegler

1,4141827

Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
3 hours ago

After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
3 hours ago

add a comment |

I will give you a stats answer to an ML question.

answered 3 hours ago

Yuval Spiegler

1,4141827

I will give you a stats answer to an ML question.

answered 3 hours ago

Yuval Spiegler

1,4141827

answered 3 hours ago

Yuval Spiegler

1,4141827

answered 3 hours ago

Yuval Spiegler

1,4141827

answered 3 hours ago

Yuval Spiegler

1,4141827

Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
3 hours ago

After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
3 hours ago

add a comment |

Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
3 hours ago

After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
3 hours ago

Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
3 hours ago

After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
3 hours ago

add a comment |

answered 3 hours ago

Ana

What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
3 hours ago

add a comment |

answered 3 hours ago

Ana

What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
3 hours ago

add a comment |

answered 3 hours ago

Ana

answered 3 hours ago

Ana

answered 3 hours ago

Ana

answered 3 hours ago

Ana

answered 3 hours ago

Ana

What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
3 hours ago

add a comment |

What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
3 hours ago

What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
3 hours ago

add a comment |

If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.

edited 3 hours ago

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
3 hours ago

add a comment |

If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.

edited 3 hours ago

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
3 hours ago

add a comment |

If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.

edited 3 hours ago

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.

edited 3 hours ago

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

edited 3 hours ago

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

answered 3 hours ago

Dimitriy V. Masterov

20.5k14092

Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
3 hours ago

add a comment |

Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
3 hours ago

Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
3 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi