What type of regression model is used when the response variable is categorical and takes on one of the two possible values?
This chapter describes how to compute regression with categorical variables. Show
Categorical variables (also known as factor or qualitative variables) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female. Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable. In these steps, the categorical variables are recoded into a set of separate binary variables. This recoding is called “dummy coding” and leads to the creation of a table called contrast matrix. This is done automatically by statistical software, such as R. Here, you’ll learn how to build and interpret a linear regression model with categorical predictor variables. We’ll also provide practical examples in R. Contents:
Loading Required R packages
Example of data setWe’ll use the The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.
Categorical variables with two levelsRecall that, the regression equation, for predicting an outcome variable (y) on the basis of a predictor variable (x), can be simply written as Suppose that, we wish to investigate differences in salaries between males and females. Based on the gender variable, we can create a new dummy variable that takes the value:
and use this variable as a predictor in the regression equation, leading to the following the model:
The coefficients can be interpreted as follow:
For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the
From the output above, the average salary for female is
estimated to be 101002, whereas males are estimated a total of 101002 + 14088 = 115090. The p-value for the dummy variable The
R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients. You can use the function
The output of the regression fit becomes:
The fact that the coefficient for Now the
estimates for Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable -1 (male) / 1 (female) . This results in the model:
So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression coefficient is negative, then addition and subtraction is reversed. Categorical variables with more than two levelsGenerally, a categorical variable with n levels will be transformed into n-1 variables each with two levels. These n-1 new variables contain the same information than the single variable. This recoding creates a table called contrast matrix. For example
This dummy coding is automatically performed by R. For demonstration purpose, you can use the function
When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level. Note that, ANOVA (analyse of variance) is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base The results of predicting salary from using a multiple regression procedure are presented below.
Taking other variables (yrs.service, rank and discipline) into account, it can be seen that the categorical variable sex is no longer significantly associated with the variation in salary between individuals. Significant variables are rank and discipline. If you want to interpret the contrasts of the categorical variable, type this:
For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments). DiscussionIn this chapter we described how categorical variables are included in linear regression model. As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables. We provide practical examples for the situations where you have categorical variables containing two or more levels. Note that, for categorical variables with a large number of levels it might be useful to group together some of the levels. Some categorical variables have levels that are ordered. They can be converted to numerical values and used as is. For example, if the professor grades (“AsstProf”, “AssocProf” and “Prof”) have a special meaning, you can convert them into numerical values, ordered from low to high, corresponding to higher-grade professors. Which regression method is used for predicting a categorical response variable?“Logistic regression and multinomial regression models are specifically designed for analysing binary and categorical response variables.” When the response variable is binary or categorical a standard linear regression model can't be used, but we can use logistic regression models instead.
Which regression is best for categorical data?LOGISTIC REGRESSION MODEL
It is highly recommended to start from this model setting before more sophisticated categorical modeling is carried out. Dependent variable yi can only take two possible outcomes.
What is used in a regression model to represent categorical variables?Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable. In these steps, the categorical variables are recoded into a set of separate binary variables.
Can regression Modelling be used for categorical data?Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.
|