Use of unconventional sources, to predict employee engagement and attrition

Rudy Nausch

Oct 13, 20225 min read

Updated: May 20, 2023

This was done as a part of a masters degree assignment.

Aim of this Proof of Concept Proposal

Employee attrition is an ongoing and predictable cost that businesses shoulder, which significantly impacts any enterprise financially, culturally, and capability-wise.

Using Machine Learning (ML) and aggregation of employees’ public and company-owned data sources, we will create a machine-learning model to determine the probability of voluntary attrition for HR practitioners to pre-emptively approach and assist employees before this happens.

For this proposal, we have used a small sample set of 1450 employees’ data (de-identifiable) that has been synthetically deterministically linked with public information from their LinkedIn accounts, and internal communication sentiment analysis software, for this proof of concept (PoC).

Based on this PoC and report, we seek approval to repeat and expand this process with the remaining employee data of this organisation for further analysis and implementation. Motivation & Feasibility

The costs of employee attrition are high, and vary from 30% to %150 of an employee’s salary (Harrison, 2022) due to the following factors:

Productivity loss due to other employees covering a vacant position
In-house hiring costs
Training and induction costs
Administration costs of termination
Loss of productivity in the beginning stages of employment
Loss of productivity during the last stages of employment

The more technical and senior a role, the higher the cost.

Average attrition globally is around ~18% (Stowers, 2022). During Covid, attrition rates jumped by over 20% (Gartner, 2022), averaging across organisations globally.

This is divided into involuntary attrition (~6%) due to poor performance or ill health and voluntary attrition based on the employee seeking other employment, retirement, or any other reason for leaving the organisation.

This puts the average annual cost of normal attrition on a company as ~13.5% of labour costs. Labour is one of the most significant costs of most enterprises.

The most widely accepted baseline for a healthy organisation is <~10% attrition per annum.

It is challenging to get useful information after voluntary attrition, as in most cases not all employees have exit interviews or are off boarded effectively and therefor the use of available data is relatively limited.

Data Model & Methodology

Step 1: Collection of data sources

Employee database from HR
Internal communications sentiment analysis
LinkedIn data collected via a web scraper

Step 2: Data cleaning to prepare and validate that the data is accurate, complete, timely and interpretable

Filled in missing values
Smoothed noisy data
Identified and removed outliers
resolved inconsistencies

Step 3: Exploratory Data Analysis (EDA) is performed to summarise the datasets main characteristics, investigate any correlations in the data set and formulate hypothesis.

Step 4: Feature engineering is then done to remove any redundant or irrelevant correlations, using both common and domain knowledge into a curated dataset. In this step we remove all features that will not add value to the learning model. Some features which have no

statistical correlation are kept, as they may become relevant in the future hypothetically.

Step 5: Data Splitting into a dataset for testing, and a validation set to check Machine Learning predictions against.

Step 6: Testing on Machine learning models to predict employee attrition.

Correlation Matrix

During the Exploratory Data Analysis (EDA) several low correlations (Jaadi, 2021) were found relative to attrition. In descending order of magnitude of correlation these were:

Rationale of ML approach

Research on a similar approach that compared several learning models namely, Extra Trees Classifier (ETC), Support vector machine (SVM), Logistic Regression (LR), and Decision Tree Classifier (DTC), the ETC and DTC models scored the highest (Raza, 2022).

Based on the EDA several Machine learning models were applicable; however, the data has a mix of categorical and integer data which discounts integer only methodologies such as Perceptron’s, and Artificial Neural Networks (ANN’s).

ID3 Decision Trees can interpret non-binary splits, unlike Classification and Regression Trees (CART), and use either Gini coefficients or entropy to branch decision paths as opposed to probability. Based on the above we have used a Decision Tree for this PoC, and we would also add Extra Tree, and ensemble methods, such as Random Forrest, as possible avenues for the future and these have been used to good effect in similar research (Srivastava, 2021)

Another of the motivating factors in selecting the ID3 (Iterative Dichotomiser) Decision Tree is Saliency and Auditing of the process and results. We can clearly see how, and why, the probability is calculated.

Evaluation

The results of the Decision Tree modelling are significant with a 97.03% accuracy score.

We would seek to validate this more intensively, as such a high accuracy may be an outlier result, pending deeper exploration. The high accuracy may be due to over-fitting of such a relatively small sample set (1450 records). This was done with a 60% training, and 40% testing split.

For the PoC this level of detail and analysis proved time and cost effective, however the ultimate execution of this model may require alternate comparisons (“Plan B’s”), and we would suggest the use of ensemble algorithms. Ensemble methods use multiple models and tend to better predict perforce.

Random Forrest classifiers – an ensemble of Decisions trees, which can correct for over-fitting.
Bootstrap aggregating (“Bagging”) – a meta-algorithm which improves stability and accuracy for models.
Boosting algorithms which reduce bias in categorisation.

The final Decision Tree

Conclusion

This proposal outlines the aim of using machine learning models to predict attrition in an organisation. The motivation is the considerable cost and effort of voluntary attrition, which can be ~13.5% of an organisations labour cost, when averaged.

The data model used is based on internal Human Resource data, and a deterministically linked data set from both public social media and internal company sentiment analysis.

The model chosen is based on referenced research, the requirement to use both categorical and integer data as input to the categorisation of the probability of an employee’s chance of voluntary attrition and for saliency, which would prohibit any “black-box” models for these kinds of predictions.

The evaluation is that the Decision Tree model is indeed effective in the attainment of the required goal, however due to the extraordinarily high accuracy we propose several ensemble methods to validate findings for the use of these models in a formal business context.

References

Srivastava, & Eachempati, P. (2021). Intelligent Employee Retention System for Attrition Rate Analysis and Churn Prediction: An Ensemble Machine Learning and Multi-Criteria Decision-Making Approach. Journal of Global Information Management, 29(6), 1–29. https://doi.org/10.4018/JGIM.20211101.oa23

BasuMallick, C. (2021, December 16). 7 Sentiment Analysis Tools to Improve Employee Engagement in 2020. Spiceworks. Retrieved October 1, 2022, from https://www.spiceworks.com/hr/engagement-retention/articles/sentiment-analytics-tools-features-price/

Norsuhada, M., & Sani, S. (2021). Machine Learning for Predicting Employee Attrition. The Science and Information Organisation. Retrieved October 1, 2022, from https://thesai.org/Downloads/Volume12No11/Paper_49-Machine_Learning_for_Predicting_Employee_Attrition.pdf

Predicting Employee Turnover. (2021, February 11). Fast Data Science. Retrieved October 1, 2022, from https://fastdatascience.com/predicting-employee-turnover/

Zhao, Y., Hryniewicki, M., Cheng, F., Fu, B., & Zhu, X. (2019). Employee Turnover Prediction with Machine Learning: A Reliable Approach. https://fastdatascience.com/. Retrieved September 15, 2022, from https://www.andrew.cmu.edu/user/yuezhao2/papers/18-intellisys-employee.pdf

Maswadi, Ghani, N. A., Hamid, S., & Rasheed, M. B. (2021). Human activity classification using Decision Tree and Naïve Bayes classifiers. Multimedia Tools and Applications, 80(14), 21709–21726. https://doi.org/10.1007/s11042-020-10447-x

Gartner Says U.S. Total Annual Employee Turnover Will Likely Jump by N. (n.d.). Gartner. Retrieved October 2, 2022, from https://www.gartner.com/en/newsroom/04-28-2022-gartner-says-us-total-annual-employee-turnover-will-likely-jump-by-nearly-twenty-percent-from-the-prepandemic-annual-average

Harrison, C. (2022, February 3). How to calculate your employee turnover costs. Harrison Human Resources Brisbane. Retrieved October 2, 2022, from https://hhr.com.au/costs-of-employee-turnover/

Jaadi, Z. (2021, December 12). Everything you need to know about interpreting correlations. Medium. Retrieved October 8, 2022, from https://towardsdatascience.com/eveything-you-need-to-know-about-interpreting-correlations-2c485841c0b8

Raza, A. (2022, June). Predicting Employee Attrition Using Machine Learning Approaches. Research  Gate. Retrieved October 1, 2022, from https://www.researchgate.net/publication/361522993_Predicting_Employee_Attrition_Using_Machine_Learning_Approaches

Stowers, J. (2022, September 1). Employee Retention: What Does Your Turnover Rate Tell You? business.com. Retrieved October 2, 2022, from https://www.business.com/articles/employee-turnover-rate/

Use of unconventional sources, to predict employee engagement and attrition

Recent Posts

Comentarios