Modeling employee attrition and describing reasons why an employee might quit.
Exploratory Analysis
First, import the data and look at the structure of the variables. There are multiple factors coded as numerics, so I choose to change them all first, then call a summary on the data to get a high-level view of it.
data <- read.csv("/Users/vincentlee/Desktop/UGRID/UGRID2017/HR_analytics.csv")
str(data)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
data[,3] <- as.factor(data[,3])
data[,5] <- as.factor(data[,5])
data[,6] <- as.factor(data[,6])
data[,7] <- as.factor(data[,7])
data[,8] <- as.factor(data[,8])
str(data)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : Factor w/ 6 levels "2","3","4","5",..: 1 4 6 4 1 1 5 4 4 1 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : Factor w/ 8 levels "2","3","4","5",..: 2 5 3 4 2 2 3 4 4 2 ...
## $ Work_accident : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
summary(data)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 2:2388 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 3:4055 1st Qu.:156.0
## Median :0.6400 Median :0.7200 4:4365 Median :200.0
## Mean :0.6128 Mean :0.7161 5:2761 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 6:1174 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 7: 256 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## 3 :6443 0:12830 0:11428 0:14680
## 2 :3244 1: 2169 1: 3571 1: 319
## 4 :2557
## 5 :1473
## 6 : 718
## 10 : 214
## (Other): 350
## sales salary
## sales :4140 high :1237
## technical :2720 low :7316
## support :2229 medium:6446
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
There are no NAs, and no extreme outliers that suggest false data. Let’s continue by splitting the data between those who left and those who stayed, then comparing the summaries to get a sense of what might differentiate the employees who quit.
library(dplyr)
left <- subset(data, left == 1)
stay <- subset(data, left == 0)
summary(left)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.4500 2:1567 Min. :126.0
## 1st Qu.:0.1300 1st Qu.:0.5200 3: 72 1st Qu.:146.0
## Median :0.4100 Median :0.7900 4: 409 Median :224.0
## Mean :0.4401 Mean :0.7181 5: 612 Mean :207.4
## 3rd Qu.:0.7300 3rd Qu.:0.9000 6: 655 3rd Qu.:262.0
## Max. :0.9200 Max. :1.0000 7: 256 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## 3 :1586 0:3402 0: 0 0:3552
## 4 : 890 1: 169 1:3571 1: 19
## 5 : 833
## 6 : 209
## 2 : 53
## 7 : 0
## (Other): 0
## sales salary
## sales :1014 high : 82
## technical : 697 low :2172
## support : 555 medium:1317
## IT : 273
## hr : 215
## accounting: 204
## (Other) : 613
summary(stay)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.1200 Min. :0.3600 2: 821 Min. : 96.0
## 1st Qu.:0.5400 1st Qu.:0.5800 3:3983 1st Qu.:162.0
## Median :0.6900 Median :0.7100 4:3956 Median :198.0
## Mean :0.6668 Mean :0.7155 5:2149 Mean :199.1
## 3rd Qu.:0.8400 3rd Qu.:0.8500 6: 519 3rd Qu.:238.0
## Max. :1.0000 Max. :1.0000 7: 0 Max. :287.0
##
## time_spend_company Work_accident left promotion_last_5years
## 3 :4857 0:9428 0:11428 0:11128
## 2 :3191 1:2000 1: 0 1: 300
## 4 :1667
## 5 : 640
## 6 : 509
## 10 : 214
## (Other): 350
## sales salary
## sales :3126 high :1155
## technical :2023 low :5144
## support :1674 medium:5129
## IT : 954
## product_mng: 704
## RandD : 666
## (Other) :2281
Some observations:
Those who left had higher monthly hours worked.
Very low level of promotions for those who left. Most high paid employees did not leave.
Those who left had a higher amount of work accidents and ~20% lower satisfaction.
High projects indicates more likely to leave - no one with 7 projects stayed. Since left
had higher monthly hours worked, I made a plot to investigate the effect these hours have on the employee’s satisfaction level among those who left, also taking into account number of projects.
library(ggplot2)
ggplot(left, aes(x = average_montly_hours, y = satisfaction_level)) +
geom_jitter (aes(color = number_project)) +
labs(title = "Employee Satisfaction vs. Hours Worked", x = "Hours per Month", y = "Satisfaction Level")
This plot indicates there are three main groups among the employees who quit: those who worked few hours per month and were moderately dissatisfied with their jobs, those who worked normal hours and were moderately-to-very satisfied with their jobs, and those who worked heavy hours who were very dissatisfied with their jobs. Don’t focus on the group that quit while they were satisfied - it is likely due to outside factors like spouse moving or outside promotion. Instead, make conditions better for unsatisfied workers, especially those with many projects and working more than 8 hours a day. Another potential relationship I wanted to explore was between an employee’s satisfaction level and their last evaluation score based on those who left.
ggplot(left, aes(x = last_evaluation, y = satisfaction_level)) + geom_jitter(aes(col = promotion_last_5years)) + labs(title = "Employee Satisfaction from Last Evaluation", x = "Last Evaluation Score", y = "Satisfaction Level")
Interestingly, the same clusters emerge: this plot indicates that the same group who works fewer hours are the ones who score low on their evaluation; on the other hand, the ones who work normal-to-heavy schedules all recieved mostly positive evaluations, but are still strongly split between enjoying and hating their jobs. The company needs to avoid assigning people who are already working 9-10 hours a day with more projects in order to keep stress down and satisfaction up. The exceedingly low number of promotions among those who left is also clear in this plot, and perhaps more upward mobility would benefit the company’s employee retention rate.
Modeling
Finally, I wanted to create a model from the data to predict an employee’s satisfaction level. I decided to use linear regression for this task, but I think using logistic regression to predict the variable left could also work well. I started off with a model including all variables, then used stepwise elimination to determine which variables were the most significant towards predicting satisfaction.
modleft <- left[,-7]
mod <- lm(satisfaction_level ~., data = modleft)
step(mod)
mod <- lm(satisfaction_level ~ last_evaluation + number_project +
average_montly_hours + time_spend_company,
data = left)
summary(mod)
##
## Call:
## lm(formula = satisfaction_level ~ last_evaluation + number_project +
## average_montly_hours + time_spend_company, data = left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70806 -0.02640 0.00396 0.03314 0.70992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.713e-01 2.453e-02 15.132 < 2e-16 ***
## last_evaluation 2.600e-01 2.287e-02 11.369 < 2e-16 ***
## number_project3 1.020e-01 1.634e-02 6.242 4.82e-10 ***
## number_project4 1.382e-01 1.314e-02 10.519 < 2e-16 ***
## number_project5 1.273e-01 1.323e-02 9.626 < 2e-16 ***
## number_project6 -2.906e-01 1.387e-02 -20.948 < 2e-16 ***
## number_project7 -3.048e-01 1.482e-02 -20.558 < 2e-16 ***
## average_montly_hours -3.932e-04 7.316e-05 -5.374 8.18e-08 ***
## time_spend_company3 -3.717e-02 1.782e-02 -2.086 0.0371 *
## time_spend_company4 -9.925e-02 1.625e-02 -6.109 1.11e-09 ***
## time_spend_company5 1.327e-01 1.583e-02 8.386 < 2e-16 ***
## time_spend_company6 1.611e-01 1.713e-02 9.401 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1071 on 3559 degrees of freedom
## Multiple R-squared: 0.8358, Adjusted R-squared: 0.8353
## F-statistic: 1646 on 11 and 3559 DF, p-value: < 2.2e-16
anova(mod)
## Analysis of Variance Table
##
## Response: satisfaction_level
## Df Sum Sq Mean Sq F value Pr(>F)
## last_evaluation 1 8.264 8.264 720.122 < 2.2e-16 ***
## number_project 5 187.598 37.520 3269.331 < 2.2e-16 ***
## average_montly_hours 1 0.584 0.584 50.881 1.184e-12 ***
## time_spend_company 4 11.399 2.850 248.314 < 2.2e-16 ***
## Residuals 3559 40.844 0.011
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model’s parameters confirm the suspicion that higher number of projects and monthly hours decrease employee satisfaction, and also imply that scoring higher on the last evaluation and loner time spent indicate higher satisfaction, which makes sense. The ANOVA indicates that number of projects and time spent have the highest impact on an employee’s happiness. The model also has a fairly high adjusted R-squared of 0.835, which is much higher than the values when the same model is trained on the stayed (or general) dataset. For some basic diagnostics using ggplot:
library(ggfortify)
autoplot(mod)
The residuals vs. fitted plot shows that the residuals follow a roughly linear pattern. The Q-Q plot shows the residuals are normally-distributed towards the center, but taper off at the beginning and end. This could potentially be addressed with the use of a transformation term. The scale-location plot shows the model’s lack of equal variance, mostly caused by the three large clusters. The residuals vs. leverage plot shows that no outliers especially affected the model. We can do some basic model fitting to see how well it works to predict the training set.
actual_scores <- left[,1]
pred <- predict(mod, left[,-1], interval="predict")
mse.train <- mean((pred - actual_scores)^2)
mse.train
## [1] 0.04094686
A MSE of only ~0.04% shows that the model can indeed predict the values it has been trained on well, but how about a more general set of data? We can’t use the exact same model because of some missing factor variables in the left
subset, but if we retrain the model on all of the data and try another test…
mod <- lm(satisfaction_level ~ last_evaluation + number_project +
average_montly_hours + time_spend_company,
data = data)
actual_scores <- data[,1]
pred <- predict(mod, data[,-1], interval="predict")
mse.test <- mean((pred - actual_scores)^2)
mse.test
## [1] 0.1426782
It’s clear the model doesn’t work as well on the more generalized set of data, with a .10 higher MSE.
Conclusions and Next Steps
ggplot(data, aes(x = time_spend_company, y = average_montly_hours, col = left)) + geom_jitter() + labs(title = "Time Spent at Company vs. Average Hours Worked", x = "Years at company", y = "Monthly hours worked")
The main cause of employee churn seems to be caused by too many projects being assigned to those already working more than the average amount per week. Spread these out among those who work the least and have the least amount of projects. Try to focus on those who have been working for 3-5 years since the data indicates that these groups are the most likely to quit and the longer an employee stays, the more satisfied and less likely to quit they are.
More advanced modeling could be done to try to predict employee satisfaction or churn. K-means clustering and random forest algorithms could be used to attempt to describe the data. More analysis could also be done into how salary and department may affect churn. Outside variables like how long employees stayed before leaving and more info about their personal lives could help us make better models and more interesting associations among variables.