In this post, we will go over some general tips on how to approach analyzing a new dataset, which we can apply to the HR Analytics dataset hosted on Kaggle.com
Initial Steps
- Load relevant libraries.
- Set working directory for project.
- Read in all relevant data.
- Rename variables for ease of use (optional).
- Check structure of data and change data types of variables which require it.
- Look for out-of-range observations or values that don’t make sense.
- Hypothesize potential interactions between response variable and explanatory variables, as well as between explanatory variables themselves.
- Employ feature engineering based on existing features (consider all features, not just “good” ones).
EDA
- Plot distribution of response variable (consider transformations like absolute value, log, etc if continuous).
- Plot distribution of missing values; keep only “good” features below a certain threshold i.e. 75% missing.
- Plot correlation of good features with response variable.
- Plot how response variables changes based on “good” features.
- Consider imputation of missing values based on the median or mode of other observations if there are enough non-missing values.
- Examine outliers.
- If given time as a variable, plot how the response variable changes over time to determine seasonal vs. general trends.
- If given locational data, make geographical plots to see how response variable changes by location.
- Bin data into distinct groups to compare trends at a higher level (i.e. highest, lowest, and 50% around the median based on given metric).
- Use clustering techniques like k-means to look for natural groupings in the data.
- Consider principal component analysis on numeric datasets to reduce down to only most important variables.
- If dataset is not in numeric form, use normalization and hot-encoding for use in algorithms like k-means, KNN, PCA, etc.
Modeling
- Linear/logistic regression are good baselines for continuous and binary problems respectively.
- Major Assumptions of Linear Regression: The relationship between the covariates and response is linear. All covariates have the same variance. The covariates do not interact.
- Consider random forest or KNN for multinomial classification problems.
- Use feature importance in models to inform how you view the data.
- XGBoost/ensemble models for tasks involving lots of data.