Stepwise regression using p-values to drop variables with nonsignificant p-values











up vote
28
down vote

favorite
32












I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.



I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.



If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.










share|improve this question




















  • 22




    I hope your boss doesn't read stackoverflow. ;-)
    – Joshua Ulrich
    Sep 13 '10 at 14:06






  • 4




    You might consider asking this question on here: stats.stackexchange.com
    – Shane
    Sep 13 '10 at 15:40






  • 8




    Might be easier to teach you boss good stats than to get R to do bad stats.
    – Richard Herron
    Sep 13 '10 at 19:26






  • 8




    Just pick three variables at random - you'll probably do just as well step-wise regression.
    – hadley
    Sep 13 '10 at 21:37






  • 7




    Does your boss also tell his doctor what medicine to prescribe and his mechanic how to fix his car?
    – Rob Hyndman
    Sep 14 '10 at 1:51















up vote
28
down vote

favorite
32












I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.



I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.



If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.










share|improve this question




















  • 22




    I hope your boss doesn't read stackoverflow. ;-)
    – Joshua Ulrich
    Sep 13 '10 at 14:06






  • 4




    You might consider asking this question on here: stats.stackexchange.com
    – Shane
    Sep 13 '10 at 15:40






  • 8




    Might be easier to teach you boss good stats than to get R to do bad stats.
    – Richard Herron
    Sep 13 '10 at 19:26






  • 8




    Just pick three variables at random - you'll probably do just as well step-wise regression.
    – hadley
    Sep 13 '10 at 21:37






  • 7




    Does your boss also tell his doctor what medicine to prescribe and his mechanic how to fix his car?
    – Rob Hyndman
    Sep 14 '10 at 1:51













up vote
28
down vote

favorite
32









up vote
28
down vote

favorite
32






32





I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.



I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.



If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.










share|improve this question















I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.



I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.



If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.







r statistics regression p-value






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 12 at 10:29









zx8754

28.9k76395




28.9k76395










asked Sep 13 '10 at 14:05









DainisZ

175136




175136








  • 22




    I hope your boss doesn't read stackoverflow. ;-)
    – Joshua Ulrich
    Sep 13 '10 at 14:06






  • 4




    You might consider asking this question on here: stats.stackexchange.com
    – Shane
    Sep 13 '10 at 15:40






  • 8




    Might be easier to teach you boss good stats than to get R to do bad stats.
    – Richard Herron
    Sep 13 '10 at 19:26






  • 8




    Just pick three variables at random - you'll probably do just as well step-wise regression.
    – hadley
    Sep 13 '10 at 21:37






  • 7




    Does your boss also tell his doctor what medicine to prescribe and his mechanic how to fix his car?
    – Rob Hyndman
    Sep 14 '10 at 1:51














  • 22




    I hope your boss doesn't read stackoverflow. ;-)
    – Joshua Ulrich
    Sep 13 '10 at 14:06






  • 4




    You might consider asking this question on here: stats.stackexchange.com
    – Shane
    Sep 13 '10 at 15:40






  • 8




    Might be easier to teach you boss good stats than to get R to do bad stats.
    – Richard Herron
    Sep 13 '10 at 19:26






  • 8




    Just pick three variables at random - you'll probably do just as well step-wise regression.
    – hadley
    Sep 13 '10 at 21:37






  • 7




    Does your boss also tell his doctor what medicine to prescribe and his mechanic how to fix his car?
    – Rob Hyndman
    Sep 14 '10 at 1:51








22




22




I hope your boss doesn't read stackoverflow. ;-)
– Joshua Ulrich
Sep 13 '10 at 14:06




I hope your boss doesn't read stackoverflow. ;-)
– Joshua Ulrich
Sep 13 '10 at 14:06




4




4




You might consider asking this question on here: stats.stackexchange.com
– Shane
Sep 13 '10 at 15:40




You might consider asking this question on here: stats.stackexchange.com
– Shane
Sep 13 '10 at 15:40




8




8




Might be easier to teach you boss good stats than to get R to do bad stats.
– Richard Herron
Sep 13 '10 at 19:26




Might be easier to teach you boss good stats than to get R to do bad stats.
– Richard Herron
Sep 13 '10 at 19:26




8




8




Just pick three variables at random - you'll probably do just as well step-wise regression.
– hadley
Sep 13 '10 at 21:37




Just pick three variables at random - you'll probably do just as well step-wise regression.
– hadley
Sep 13 '10 at 21:37




7




7




Does your boss also tell his doctor what medicine to prescribe and his mechanic how to fix his car?
– Rob Hyndman
Sep 14 '10 at 1:51




Does your boss also tell his doctor what medicine to prescribe and his mechanic how to fix his car?
– Rob Hyndman
Sep 14 '10 at 1:51












6 Answers
6






active

oldest

votes

















up vote
27
down vote



accepted










Show your boss the following :



set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))

y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))


Which gives :



            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1525 0.3066 -0.498 0.61995
x1 1.8693 0.6045 3.092 0.00261 **
x2b 2.5149 0.4334 5.802 8.77e-08 ***
x2c 0.3089 0.4475 0.690 0.49180
x1:x2b -1.1239 0.8022 -1.401 0.16451
x1:x2c -1.0497 0.7873 -1.333 0.18566
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.





Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.



I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.



#####################################
# Automated model selection
# Author : Joris Meys
# version : 0.2
# date : 12/01/09
#####################################
#CHANGE LOG
# 0.2 : check for empty scopevar vector
#####################################

# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
out <- sapply(terms,function(i){
sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
})
return(sum(out)>0)
}

# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after
model.select <- function(model,keep,sig=0.05,verbose=F){
counter=1
# check input
if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm objectn"))
# calculate scope for drop1 function
terms <- attr(model$terms,"term.labels")
if(missing(keep)){ # set scopevars to all terms
scopevars <- terms
} else{ # select the scopevars if keep is used
index <- match(keep,terms)
# check if all is specified correctly
if(sum(is.na(index))>0){
novar <- keep[is.na(index)]
warning(paste(
c(novar,"cannot be found in the model",
"nThese terms are ignored in the model selection."),
collapse=" "))
index <- as.vector(na.omit(index))
}
scopevars <- terms[-index]
}

# Backward model selection :

while(T){
# extract the test statistics from drop.
test <- drop1(model, scope=scopevars,test="F")

if(verbose){
cat("-------------STEP ",counter,"-------------n",
"The drop statistics : n")
print(test)
}

pval <- test[,dim(test)[2]]

names(pval) <- rownames(test)
pval <- sort(pval,decreasing=T)

if(sum(is.na(pval))>0) stop(paste("Model",
deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

# check if all significant
if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

# select var to drop
i=1
while(T){
dropvar <- names(pval)[i]
check.terms <- terms[-match(dropvar,terms)]
x <- has.interaction(dropvar,check.terms)
if(x){i=i+1;next} else {break}
} # end while(T) drop var

if(pval[i]<sig) break # stops the loop if var to remove is significant

if(verbose){
cat("n--------nTerm dropped in step",counter,":",dropvar,"n--------nn")
}

#update terms, scopevars and model
scopevars <- scopevars[-match(dropvar,scopevars)]
terms <- terms[-match(dropvar,terms)]

formul <- as.formula(paste(".~.-",dropvar))
model <- update(model,formul)

if(length(scopevars)==0) {
warning("All variables are thrown out of the model.n",
"No model could be specified.")
return()
}
counter=counter+1
} # end while(T) main loop
return(model)
}





share|improve this answer























  • Nice example - I could have used this a couple of months ago!
    – Matt Parker
    Sep 13 '10 at 15:43










  • I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
    – George Dontas
    Sep 13 '10 at 15:56










  • off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
    – Joris Meys
    Sep 13 '10 at 15:59












  • No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
    – DainisZ
    Sep 13 '10 at 16:00








  • 1




    I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
    – Joris Meys
    Sep 13 '10 at 16:56


















up vote
17
down vote













Why not try using the step() function specifying your testing method?



For example, for backward elimination, you type only a command:



step(FullModel, direction = "backward", test = "F")


and for stepwise selection, simply:



step(FullModel, direction = "both", test = "F")


This can display both the AIC values as well as the F and P values.






share|improve this answer



















  • 1




    Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
    – Davor Josipovic
    Dec 5 '16 at 16:10




















up vote
10
down vote













Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.



model1 <-lm (ozone~temp*wind*rad)
summary(model1)

Coefficients:
Estimate Std.Error t value Pr(>t)
(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
temp -1.076e+01 4.303e+00 -2.501 0.01401 *
wind -3.237e+01 1.173e+01 -2.760 0.00687 **
rad -3.117e-01 5.585e-01 -0.558 0.57799
temp:wind 2.377e-01 1.367e-01 1.739 0.08519
temp:rad 8.402e-03 7.512e-03 1.119 0.26602
wind:rad 2.054e-02 4.892e-02 0.420 0.47552
temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358


The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:



model2 <- update(model1,~. - temp:wind:rad)
summary(model2)


Depending on the results, you can continue simplifying your model:



model3 <- update(model2,~. - temp:rad)
summary(model3)
...


Alternatively you can use the automatic model simplification function step, to see
how well it does:



model_step <- step(model1)





share|improve this answer

















  • 1




    Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
    – DainisZ
    Sep 13 '10 at 15:03




















up vote
7
down vote













If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.



Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.



See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.






share|improve this answer






























    up vote
    7
    down vote













    Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.






    share|improve this answer






























      up vote
      0
      down vote













      As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.



      require(rms)
      model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
      model2 <- fastbw(fit=model1, rule="p", sls=0.05)
      model2





      share|improve this answer























      • In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
        – rvl
        Nov 23 at 19:47












      • Thanks rvl for the clarification. I fixed it.
        – fhernanb
        Nov 23 at 20:11













      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f3701170%2fstepwise-regression-using-p-values-to-drop-variables-with-nonsignificant-p-value%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      27
      down vote



      accepted










      Show your boss the following :



      set.seed(100)
      x1 <- runif(100,0,1)
      x2 <- as.factor(sample(letters[1:3],100,replace=T))

      y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
      summary(lm(y~x1*x2))


      Which gives :



                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -0.1525 0.3066 -0.498 0.61995
      x1 1.8693 0.6045 3.092 0.00261 **
      x2b 2.5149 0.4334 5.802 8.77e-08 ***
      x2c 0.3089 0.4475 0.690 0.49180
      x1:x2b -1.1239 0.8022 -1.401 0.16451
      x1:x2c -1.0497 0.7873 -1.333 0.18566
      ---
      Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


      Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.





      Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.



      I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.



      #####################################
      # Automated model selection
      # Author : Joris Meys
      # version : 0.2
      # date : 12/01/09
      #####################################
      #CHANGE LOG
      # 0.2 : check for empty scopevar vector
      #####################################

      # Function has.interaction checks whether x is part of a term in terms
      # terms is a vector with names of terms from a model
      has.interaction <- function(x,terms){
      out <- sapply(terms,function(i){
      sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
      })
      return(sum(out)>0)
      }

      # Function Model.select
      # model is the lm object of the full model
      # keep is a list of model terms to keep in the model at all times
      # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
      # verbose=T gives the F-tests, dropped var and resulting model after
      model.select <- function(model,keep,sig=0.05,verbose=F){
      counter=1
      # check input
      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm objectn"))
      # calculate scope for drop1 function
      terms <- attr(model$terms,"term.labels")
      if(missing(keep)){ # set scopevars to all terms
      scopevars <- terms
      } else{ # select the scopevars if keep is used
      index <- match(keep,terms)
      # check if all is specified correctly
      if(sum(is.na(index))>0){
      novar <- keep[is.na(index)]
      warning(paste(
      c(novar,"cannot be found in the model",
      "nThese terms are ignored in the model selection."),
      collapse=" "))
      index <- as.vector(na.omit(index))
      }
      scopevars <- terms[-index]
      }

      # Backward model selection :

      while(T){
      # extract the test statistics from drop.
      test <- drop1(model, scope=scopevars,test="F")

      if(verbose){
      cat("-------------STEP ",counter,"-------------n",
      "The drop statistics : n")
      print(test)
      }

      pval <- test[,dim(test)[2]]

      names(pval) <- rownames(test)
      pval <- sort(pval,decreasing=T)

      if(sum(is.na(pval))>0) stop(paste("Model",
      deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

      # check if all significant
      if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

      # select var to drop
      i=1
      while(T){
      dropvar <- names(pval)[i]
      check.terms <- terms[-match(dropvar,terms)]
      x <- has.interaction(dropvar,check.terms)
      if(x){i=i+1;next} else {break}
      } # end while(T) drop var

      if(pval[i]<sig) break # stops the loop if var to remove is significant

      if(verbose){
      cat("n--------nTerm dropped in step",counter,":",dropvar,"n--------nn")
      }

      #update terms, scopevars and model
      scopevars <- scopevars[-match(dropvar,scopevars)]
      terms <- terms[-match(dropvar,terms)]

      formul <- as.formula(paste(".~.-",dropvar))
      model <- update(model,formul)

      if(length(scopevars)==0) {
      warning("All variables are thrown out of the model.n",
      "No model could be specified.")
      return()
      }
      counter=counter+1
      } # end while(T) main loop
      return(model)
      }





      share|improve this answer























      • Nice example - I could have used this a couple of months ago!
        – Matt Parker
        Sep 13 '10 at 15:43










      • I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
        – George Dontas
        Sep 13 '10 at 15:56










      • off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
        – Joris Meys
        Sep 13 '10 at 15:59












      • No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
        – DainisZ
        Sep 13 '10 at 16:00








      • 1




        I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
        – Joris Meys
        Sep 13 '10 at 16:56















      up vote
      27
      down vote



      accepted










      Show your boss the following :



      set.seed(100)
      x1 <- runif(100,0,1)
      x2 <- as.factor(sample(letters[1:3],100,replace=T))

      y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
      summary(lm(y~x1*x2))


      Which gives :



                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -0.1525 0.3066 -0.498 0.61995
      x1 1.8693 0.6045 3.092 0.00261 **
      x2b 2.5149 0.4334 5.802 8.77e-08 ***
      x2c 0.3089 0.4475 0.690 0.49180
      x1:x2b -1.1239 0.8022 -1.401 0.16451
      x1:x2c -1.0497 0.7873 -1.333 0.18566
      ---
      Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


      Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.





      Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.



      I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.



      #####################################
      # Automated model selection
      # Author : Joris Meys
      # version : 0.2
      # date : 12/01/09
      #####################################
      #CHANGE LOG
      # 0.2 : check for empty scopevar vector
      #####################################

      # Function has.interaction checks whether x is part of a term in terms
      # terms is a vector with names of terms from a model
      has.interaction <- function(x,terms){
      out <- sapply(terms,function(i){
      sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
      })
      return(sum(out)>0)
      }

      # Function Model.select
      # model is the lm object of the full model
      # keep is a list of model terms to keep in the model at all times
      # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
      # verbose=T gives the F-tests, dropped var and resulting model after
      model.select <- function(model,keep,sig=0.05,verbose=F){
      counter=1
      # check input
      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm objectn"))
      # calculate scope for drop1 function
      terms <- attr(model$terms,"term.labels")
      if(missing(keep)){ # set scopevars to all terms
      scopevars <- terms
      } else{ # select the scopevars if keep is used
      index <- match(keep,terms)
      # check if all is specified correctly
      if(sum(is.na(index))>0){
      novar <- keep[is.na(index)]
      warning(paste(
      c(novar,"cannot be found in the model",
      "nThese terms are ignored in the model selection."),
      collapse=" "))
      index <- as.vector(na.omit(index))
      }
      scopevars <- terms[-index]
      }

      # Backward model selection :

      while(T){
      # extract the test statistics from drop.
      test <- drop1(model, scope=scopevars,test="F")

      if(verbose){
      cat("-------------STEP ",counter,"-------------n",
      "The drop statistics : n")
      print(test)
      }

      pval <- test[,dim(test)[2]]

      names(pval) <- rownames(test)
      pval <- sort(pval,decreasing=T)

      if(sum(is.na(pval))>0) stop(paste("Model",
      deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

      # check if all significant
      if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

      # select var to drop
      i=1
      while(T){
      dropvar <- names(pval)[i]
      check.terms <- terms[-match(dropvar,terms)]
      x <- has.interaction(dropvar,check.terms)
      if(x){i=i+1;next} else {break}
      } # end while(T) drop var

      if(pval[i]<sig) break # stops the loop if var to remove is significant

      if(verbose){
      cat("n--------nTerm dropped in step",counter,":",dropvar,"n--------nn")
      }

      #update terms, scopevars and model
      scopevars <- scopevars[-match(dropvar,scopevars)]
      terms <- terms[-match(dropvar,terms)]

      formul <- as.formula(paste(".~.-",dropvar))
      model <- update(model,formul)

      if(length(scopevars)==0) {
      warning("All variables are thrown out of the model.n",
      "No model could be specified.")
      return()
      }
      counter=counter+1
      } # end while(T) main loop
      return(model)
      }





      share|improve this answer























      • Nice example - I could have used this a couple of months ago!
        – Matt Parker
        Sep 13 '10 at 15:43










      • I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
        – George Dontas
        Sep 13 '10 at 15:56










      • off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
        – Joris Meys
        Sep 13 '10 at 15:59












      • No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
        – DainisZ
        Sep 13 '10 at 16:00








      • 1




        I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
        – Joris Meys
        Sep 13 '10 at 16:56













      up vote
      27
      down vote



      accepted







      up vote
      27
      down vote



      accepted






      Show your boss the following :



      set.seed(100)
      x1 <- runif(100,0,1)
      x2 <- as.factor(sample(letters[1:3],100,replace=T))

      y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
      summary(lm(y~x1*x2))


      Which gives :



                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -0.1525 0.3066 -0.498 0.61995
      x1 1.8693 0.6045 3.092 0.00261 **
      x2b 2.5149 0.4334 5.802 8.77e-08 ***
      x2c 0.3089 0.4475 0.690 0.49180
      x1:x2b -1.1239 0.8022 -1.401 0.16451
      x1:x2c -1.0497 0.7873 -1.333 0.18566
      ---
      Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


      Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.





      Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.



      I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.



      #####################################
      # Automated model selection
      # Author : Joris Meys
      # version : 0.2
      # date : 12/01/09
      #####################################
      #CHANGE LOG
      # 0.2 : check for empty scopevar vector
      #####################################

      # Function has.interaction checks whether x is part of a term in terms
      # terms is a vector with names of terms from a model
      has.interaction <- function(x,terms){
      out <- sapply(terms,function(i){
      sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
      })
      return(sum(out)>0)
      }

      # Function Model.select
      # model is the lm object of the full model
      # keep is a list of model terms to keep in the model at all times
      # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
      # verbose=T gives the F-tests, dropped var and resulting model after
      model.select <- function(model,keep,sig=0.05,verbose=F){
      counter=1
      # check input
      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm objectn"))
      # calculate scope for drop1 function
      terms <- attr(model$terms,"term.labels")
      if(missing(keep)){ # set scopevars to all terms
      scopevars <- terms
      } else{ # select the scopevars if keep is used
      index <- match(keep,terms)
      # check if all is specified correctly
      if(sum(is.na(index))>0){
      novar <- keep[is.na(index)]
      warning(paste(
      c(novar,"cannot be found in the model",
      "nThese terms are ignored in the model selection."),
      collapse=" "))
      index <- as.vector(na.omit(index))
      }
      scopevars <- terms[-index]
      }

      # Backward model selection :

      while(T){
      # extract the test statistics from drop.
      test <- drop1(model, scope=scopevars,test="F")

      if(verbose){
      cat("-------------STEP ",counter,"-------------n",
      "The drop statistics : n")
      print(test)
      }

      pval <- test[,dim(test)[2]]

      names(pval) <- rownames(test)
      pval <- sort(pval,decreasing=T)

      if(sum(is.na(pval))>0) stop(paste("Model",
      deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

      # check if all significant
      if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

      # select var to drop
      i=1
      while(T){
      dropvar <- names(pval)[i]
      check.terms <- terms[-match(dropvar,terms)]
      x <- has.interaction(dropvar,check.terms)
      if(x){i=i+1;next} else {break}
      } # end while(T) drop var

      if(pval[i]<sig) break # stops the loop if var to remove is significant

      if(verbose){
      cat("n--------nTerm dropped in step",counter,":",dropvar,"n--------nn")
      }

      #update terms, scopevars and model
      scopevars <- scopevars[-match(dropvar,scopevars)]
      terms <- terms[-match(dropvar,terms)]

      formul <- as.formula(paste(".~.-",dropvar))
      model <- update(model,formul)

      if(length(scopevars)==0) {
      warning("All variables are thrown out of the model.n",
      "No model could be specified.")
      return()
      }
      counter=counter+1
      } # end while(T) main loop
      return(model)
      }





      share|improve this answer














      Show your boss the following :



      set.seed(100)
      x1 <- runif(100,0,1)
      x2 <- as.factor(sample(letters[1:3],100,replace=T))

      y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
      summary(lm(y~x1*x2))


      Which gives :



                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -0.1525 0.3066 -0.498 0.61995
      x1 1.8693 0.6045 3.092 0.00261 **
      x2b 2.5149 0.4334 5.802 8.77e-08 ***
      x2c 0.3089 0.4475 0.690 0.49180
      x1:x2b -1.1239 0.8022 -1.401 0.16451
      x1:x2c -1.0497 0.7873 -1.333 0.18566
      ---
      Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


      Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.





      Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.



      I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.



      #####################################
      # Automated model selection
      # Author : Joris Meys
      # version : 0.2
      # date : 12/01/09
      #####################################
      #CHANGE LOG
      # 0.2 : check for empty scopevar vector
      #####################################

      # Function has.interaction checks whether x is part of a term in terms
      # terms is a vector with names of terms from a model
      has.interaction <- function(x,terms){
      out <- sapply(terms,function(i){
      sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
      })
      return(sum(out)>0)
      }

      # Function Model.select
      # model is the lm object of the full model
      # keep is a list of model terms to keep in the model at all times
      # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
      # verbose=T gives the F-tests, dropped var and resulting model after
      model.select <- function(model,keep,sig=0.05,verbose=F){
      counter=1
      # check input
      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm objectn"))
      # calculate scope for drop1 function
      terms <- attr(model$terms,"term.labels")
      if(missing(keep)){ # set scopevars to all terms
      scopevars <- terms
      } else{ # select the scopevars if keep is used
      index <- match(keep,terms)
      # check if all is specified correctly
      if(sum(is.na(index))>0){
      novar <- keep[is.na(index)]
      warning(paste(
      c(novar,"cannot be found in the model",
      "nThese terms are ignored in the model selection."),
      collapse=" "))
      index <- as.vector(na.omit(index))
      }
      scopevars <- terms[-index]
      }

      # Backward model selection :

      while(T){
      # extract the test statistics from drop.
      test <- drop1(model, scope=scopevars,test="F")

      if(verbose){
      cat("-------------STEP ",counter,"-------------n",
      "The drop statistics : n")
      print(test)
      }

      pval <- test[,dim(test)[2]]

      names(pval) <- rownames(test)
      pval <- sort(pval,decreasing=T)

      if(sum(is.na(pval))>0) stop(paste("Model",
      deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

      # check if all significant
      if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

      # select var to drop
      i=1
      while(T){
      dropvar <- names(pval)[i]
      check.terms <- terms[-match(dropvar,terms)]
      x <- has.interaction(dropvar,check.terms)
      if(x){i=i+1;next} else {break}
      } # end while(T) drop var

      if(pval[i]<sig) break # stops the loop if var to remove is significant

      if(verbose){
      cat("n--------nTerm dropped in step",counter,":",dropvar,"n--------nn")
      }

      #update terms, scopevars and model
      scopevars <- scopevars[-match(dropvar,scopevars)]
      terms <- terms[-match(dropvar,terms)]

      formul <- as.formula(paste(".~.-",dropvar))
      model <- update(model,formul)

      if(length(scopevars)==0) {
      warning("All variables are thrown out of the model.n",
      "No model could be specified.")
      return()
      }
      counter=counter+1
      } # end while(T) main loop
      return(model)
      }






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Sep 13 '10 at 23:57

























      answered Sep 13 '10 at 15:32









      Joris Meys

      79.3k24179236




      79.3k24179236












      • Nice example - I could have used this a couple of months ago!
        – Matt Parker
        Sep 13 '10 at 15:43










      • I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
        – George Dontas
        Sep 13 '10 at 15:56










      • off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
        – Joris Meys
        Sep 13 '10 at 15:59












      • No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
        – DainisZ
        Sep 13 '10 at 16:00








      • 1




        I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
        – Joris Meys
        Sep 13 '10 at 16:56


















      • Nice example - I could have used this a couple of months ago!
        – Matt Parker
        Sep 13 '10 at 15:43










      • I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
        – George Dontas
        Sep 13 '10 at 15:56










      • off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
        – Joris Meys
        Sep 13 '10 at 15:59












      • No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
        – DainisZ
        Sep 13 '10 at 16:00








      • 1




        I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
        – Joris Meys
        Sep 13 '10 at 16:56
















      Nice example - I could have used this a couple of months ago!
      – Matt Parker
      Sep 13 '10 at 15:43




      Nice example - I could have used this a couple of months ago!
      – Matt Parker
      Sep 13 '10 at 15:43












      I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
      – George Dontas
      Sep 13 '10 at 15:56




      I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
      – George Dontas
      Sep 13 '10 at 15:56












      off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
      – Joris Meys
      Sep 13 '10 at 15:59






      off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
      – Joris Meys
      Sep 13 '10 at 15:59














      No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
      – DainisZ
      Sep 13 '10 at 16:00






      No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
      – DainisZ
      Sep 13 '10 at 16:00






      1




      1




      I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
      – Joris Meys
      Sep 13 '10 at 16:56




      I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
      – Joris Meys
      Sep 13 '10 at 16:56












      up vote
      17
      down vote













      Why not try using the step() function specifying your testing method?



      For example, for backward elimination, you type only a command:



      step(FullModel, direction = "backward", test = "F")


      and for stepwise selection, simply:



      step(FullModel, direction = "both", test = "F")


      This can display both the AIC values as well as the F and P values.






      share|improve this answer



















      • 1




        Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
        – Davor Josipovic
        Dec 5 '16 at 16:10

















      up vote
      17
      down vote













      Why not try using the step() function specifying your testing method?



      For example, for backward elimination, you type only a command:



      step(FullModel, direction = "backward", test = "F")


      and for stepwise selection, simply:



      step(FullModel, direction = "both", test = "F")


      This can display both the AIC values as well as the F and P values.






      share|improve this answer



















      • 1




        Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
        – Davor Josipovic
        Dec 5 '16 at 16:10















      up vote
      17
      down vote










      up vote
      17
      down vote









      Why not try using the step() function specifying your testing method?



      For example, for backward elimination, you type only a command:



      step(FullModel, direction = "backward", test = "F")


      and for stepwise selection, simply:



      step(FullModel, direction = "both", test = "F")


      This can display both the AIC values as well as the F and P values.






      share|improve this answer














      Why not try using the step() function specifying your testing method?



      For example, for backward elimination, you type only a command:



      step(FullModel, direction = "backward", test = "F")


      and for stepwise selection, simply:



      step(FullModel, direction = "both", test = "F")


      This can display both the AIC values as well as the F and P values.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 12 at 10:36









      Jaap

      54.6k20116129




      54.6k20116129










      answered Jun 14 '12 at 3:37









      leonie

      17112




      17112








      • 1




        Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
        – Davor Josipovic
        Dec 5 '16 at 16:10
















      • 1




        Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
        – Davor Josipovic
        Dec 5 '16 at 16:10










      1




      1




      Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
      – Davor Josipovic
      Dec 5 '16 at 16:10






      Note that this procedure will add/remove the variable with the most significant test value. But the choice to continue or stop remains based on AIC. I don't think it is possible to stop the procedure once there are no significant variables (like the OP wants).
      – Davor Josipovic
      Dec 5 '16 at 16:10












      up vote
      10
      down vote













      Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.



      model1 <-lm (ozone~temp*wind*rad)
      summary(model1)

      Coefficients:
      Estimate Std.Error t value Pr(>t)
      (Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
      temp -1.076e+01 4.303e+00 -2.501 0.01401 *
      wind -3.237e+01 1.173e+01 -2.760 0.00687 **
      rad -3.117e-01 5.585e-01 -0.558 0.57799
      temp:wind 2.377e-01 1.367e-01 1.739 0.08519
      temp:rad 8.402e-03 7.512e-03 1.119 0.26602
      wind:rad 2.054e-02 4.892e-02 0.420 0.47552
      temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358


      The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:



      model2 <- update(model1,~. - temp:wind:rad)
      summary(model2)


      Depending on the results, you can continue simplifying your model:



      model3 <- update(model2,~. - temp:rad)
      summary(model3)
      ...


      Alternatively you can use the automatic model simplification function step, to see
      how well it does:



      model_step <- step(model1)





      share|improve this answer

















      • 1




        Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
        – DainisZ
        Sep 13 '10 at 15:03

















      up vote
      10
      down vote













      Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.



      model1 <-lm (ozone~temp*wind*rad)
      summary(model1)

      Coefficients:
      Estimate Std.Error t value Pr(>t)
      (Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
      temp -1.076e+01 4.303e+00 -2.501 0.01401 *
      wind -3.237e+01 1.173e+01 -2.760 0.00687 **
      rad -3.117e-01 5.585e-01 -0.558 0.57799
      temp:wind 2.377e-01 1.367e-01 1.739 0.08519
      temp:rad 8.402e-03 7.512e-03 1.119 0.26602
      wind:rad 2.054e-02 4.892e-02 0.420 0.47552
      temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358


      The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:



      model2 <- update(model1,~. - temp:wind:rad)
      summary(model2)


      Depending on the results, you can continue simplifying your model:



      model3 <- update(model2,~. - temp:rad)
      summary(model3)
      ...


      Alternatively you can use the automatic model simplification function step, to see
      how well it does:



      model_step <- step(model1)





      share|improve this answer

















      • 1




        Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
        – DainisZ
        Sep 13 '10 at 15:03















      up vote
      10
      down vote










      up vote
      10
      down vote









      Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.



      model1 <-lm (ozone~temp*wind*rad)
      summary(model1)

      Coefficients:
      Estimate Std.Error t value Pr(>t)
      (Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
      temp -1.076e+01 4.303e+00 -2.501 0.01401 *
      wind -3.237e+01 1.173e+01 -2.760 0.00687 **
      rad -3.117e-01 5.585e-01 -0.558 0.57799
      temp:wind 2.377e-01 1.367e-01 1.739 0.08519
      temp:rad 8.402e-03 7.512e-03 1.119 0.26602
      wind:rad 2.054e-02 4.892e-02 0.420 0.47552
      temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358


      The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:



      model2 <- update(model1,~. - temp:wind:rad)
      summary(model2)


      Depending on the results, you can continue simplifying your model:



      model3 <- update(model2,~. - temp:rad)
      summary(model3)
      ...


      Alternatively you can use the automatic model simplification function step, to see
      how well it does:



      model_step <- step(model1)





      share|improve this answer












      Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.



      model1 <-lm (ozone~temp*wind*rad)
      summary(model1)

      Coefficients:
      Estimate Std.Error t value Pr(>t)
      (Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
      temp -1.076e+01 4.303e+00 -2.501 0.01401 *
      wind -3.237e+01 1.173e+01 -2.760 0.00687 **
      rad -3.117e-01 5.585e-01 -0.558 0.57799
      temp:wind 2.377e-01 1.367e-01 1.739 0.08519
      temp:rad 8.402e-03 7.512e-03 1.119 0.26602
      wind:rad 2.054e-02 4.892e-02 0.420 0.47552
      temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358


      The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:



      model2 <- update(model1,~. - temp:wind:rad)
      summary(model2)


      Depending on the results, you can continue simplifying your model:



      model3 <- update(model2,~. - temp:rad)
      summary(model3)
      ...


      Alternatively you can use the automatic model simplification function step, to see
      how well it does:



      model_step <- step(model1)






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Sep 13 '10 at 14:39









      George Dontas

      21.1k1586125




      21.1k1586125








      • 1




        Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
        – DainisZ
        Sep 13 '10 at 15:03
















      • 1




        Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
        – DainisZ
        Sep 13 '10 at 15:03










      1




      1




      Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
      – DainisZ
      Sep 13 '10 at 15:03






      Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
      – DainisZ
      Sep 13 '10 at 15:03












      up vote
      7
      down vote













      If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.



      Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.



      See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.






      share|improve this answer



























        up vote
        7
        down vote













        If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.



        Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.



        See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.






        share|improve this answer

























          up vote
          7
          down vote










          up vote
          7
          down vote









          If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.



          Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.



          See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.






          share|improve this answer














          If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.



          Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.



          See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 12 at 10:27









          zx8754

          28.9k76395




          28.9k76395










          answered Sep 21 '10 at 17:54









          Gavin Simpson

          134k19300378




          134k19300378






















              up vote
              7
              down vote













              Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.






              share|improve this answer



























                up vote
                7
                down vote













                Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.






                share|improve this answer

























                  up vote
                  7
                  down vote










                  up vote
                  7
                  down vote









                  Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.






                  share|improve this answer














                  Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Feb 12 at 10:28









                  zx8754

                  28.9k76395




                  28.9k76395










                  answered Dec 2 '11 at 19:21









                  Chris

                  9111




                  9111






















                      up vote
                      0
                      down vote













                      As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.



                      require(rms)
                      model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
                      model2 <- fastbw(fit=model1, rule="p", sls=0.05)
                      model2





                      share|improve this answer























                      • In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
                        – rvl
                        Nov 23 at 19:47












                      • Thanks rvl for the clarification. I fixed it.
                        – fhernanb
                        Nov 23 at 20:11

















                      up vote
                      0
                      down vote













                      As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.



                      require(rms)
                      model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
                      model2 <- fastbw(fit=model1, rule="p", sls=0.05)
                      model2





                      share|improve this answer























                      • In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
                        – rvl
                        Nov 23 at 19:47












                      • Thanks rvl for the clarification. I fixed it.
                        – fhernanb
                        Nov 23 at 20:11















                      up vote
                      0
                      down vote










                      up vote
                      0
                      down vote









                      As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.



                      require(rms)
                      model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
                      model2 <- fastbw(fit=model1, rule="p", sls=0.05)
                      model2





                      share|improve this answer














                      As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.



                      require(rms)
                      model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
                      model2 <- fastbw(fit=model1, rule="p", sls=0.05)
                      model2






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Nov 23 at 20:10

























                      answered Nov 22 at 15:59









                      fhernanb

                      16619




                      16619












                      • In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
                        – rvl
                        Nov 23 at 19:47












                      • Thanks rvl for the clarification. I fixed it.
                        – fhernanb
                        Nov 23 at 20:11




















                      • In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
                        – rvl
                        Nov 23 at 19:47












                      • Thanks rvl for the clarification. I fixed it.
                        – fhernanb
                        Nov 23 at 20:11


















                      In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
                      – rvl
                      Nov 23 at 19:47






                      In the text, you meant the rms package. There is also an rsm package, so this could cause confusion.
                      – rvl
                      Nov 23 at 19:47














                      Thanks rvl for the clarification. I fixed it.
                      – fhernanb
                      Nov 23 at 20:11






                      Thanks rvl for the clarification. I fixed it.
                      – fhernanb
                      Nov 23 at 20:11




















                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f3701170%2fstepwise-regression-using-p-values-to-drop-variables-with-nonsignificant-p-value%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      What visual should I use to simply compare current year value vs last year in Power BI desktop

                      How to ignore python UserWarning in pytest?

                      Alexandru Averescu