In this post, we will go over some general tips on how to approach analyzing a new dataset, which we can apply to the HR Analytics dataset hosted on Kaggle.com

Initial Steps

  • Load relevant libraries.
  • Set working directory for project.
  • Read in all relevant data.
  • Rename variables for ease of use (optional).
  • Check structure of data and change data types of variables which require it.
  • Look for out-of-range observations or values that don’t make sense.
  • Hypothesize potential interactions between response variable and explanatory variables, as well as between explanatory variables themselves.
  • Employ feature engineering based on existing features (consider all features, not just “good” ones).

EDA

  • Plot distribution of response variable (consider transformations like absolute value, log, etc if continuous).
  • Plot distribution of missing values; keep only “good” features below a certain threshold i.e. 75% missing.
  • Plot correlation of good features with response variable.
  • Plot how response variables changes based on “good” features.
  • Consider imputation of missing values based on the median or mode of other observations if there are enough non-missing values.
  • Examine outliers.
  • If given time as a variable, plot how the response variable changes over time to determine seasonal vs. general trends.
  • If given locational data, make geographical plots to see how response variable changes by location.
  • Bin data into distinct groups to compare trends at a higher level (i.e. highest, lowest, and 50% around the median based on given metric).
  • Use clustering techniques like k-means to look for natural groupings in the data.
  • Consider principal component analysis on numeric datasets to reduce down to only most important variables.
  • If dataset is not in numeric form, use normalization and hot-encoding for use in algorithms like k-means, KNN, PCA, etc.

Modeling

  • Linear/logistic regression are good baselines for continuous and binary problems respectively.
    • Major Assumptions of Linear Regression: The relationship between the covariates and response is linear. All covariates have the same variance. The covariates do not interact.
  • Consider random forest or KNN for multinomial classification problems.
  • Use feature importance in models to inform how you view the data.
  • XGBoost/ensemble models for tasks involving lots of data.