This post was originally published by Sheenal Srivastava at Towards Data Science
Big data and machine learning
I was given a large dataset of files, what some would like to call big data and told to come up with a solution to the data problem. People often associate big data with machine learning and automatically jump to a machine learning solution. However, after working with this dataset, I realised that machine learning was not the solution. The dataset was provided to me as a case study that I had to complete as part of a 4-step interview process.
The dataset consists of a set of files tracking email activity across multiple construction projects. All data has been anonymised. The task was to explore the dataset and report back with any insights. I was informed that clients are concerned with project duration and the number of site instructions and variations seen on a project as these typically cost money.
Firstly, correspondence data was read in and appended together. The data was checked for duplicates. No duplicates were found.
As clients are concerned with project duration, the difference between response required by date and sent date was calculated in days. However, there were quite a few missing values for response required by date and some for sent date. These records were excluded; so the dataset was reduced from 20,006,768 records for 7 variables to 3,895,037. Then, the correspondence data was combined with mail types file to determine whether the type of correspondence has an impact on duration. Finally, a file containing the number of records for each project was combined.
Usually, it is not good practice to exclude data without a valid reason; however, as this dataset was assigned to me as part of a job application process I did not have the opportunity to better understand the dataset in order to impute the dates.
As you can see from the below code to import the dataset, we do not have much information on the emails other than projectID, number of records, typeID and typeName.
As a large number of .csv files had to be read in from a single directory, I used lapply with the fread function to read in the files and appended the list of files using rbindlist.
It is good practice to save your consolidated dataset object as an R object to avoid having to rerun the import code at a later date as this process can be quite consuming depending on the number of files.
New variables were created to understand project duration. These are duration in days and whether the project was submitted after response required by date. If yes, then it was late, otherwise, it was early or on time.
The unique correspondence dataset was then joined to the mail types file on correspondence type ID (primary key). This was later joined to the main file on project ID.
My initial insights after the joins are as follows:
- Correspondence id is unique so no aggregation is required and will not assist analysis.
- There are too many organisation ids and there is no sensible way to group them so will exclude as predictor.
- There are too many userIDs — not a useful predictor especially as some only have a frequency count of 1. Also excluded as a predictor.
To create a unique row for each correspondence ID, project ID and typeName, I needed to aggregate the other features. I did this by calculating frequencies or counts for the number of late submissions, number of early submissions and by calculating summary statistics such as maximum, minimum, and mean for duration in days.
Grouping the data reduced the dataset size to 51,156 observations for 7 variables. The sample size at the moment is too small, however, a large majority of the data was reduced due to missing response date and including every single record per project ID would be a comparison per organisation and user ID. It appears that the client is interested in site instructions and variations which can probably be found in correspondence type ID and typeName and having the data at project ID would be too granular for the task requirements.
We do have an issue where we do not know what each ID stands for and whether it is important.
Two types of models were run:
GLM (Gaussian) was carried out to determine the linear combination of best predictors that are likely to have an impact on the increase or decrease of average duration days.
GBM (Gaussian) was run to again identify the top predictors and how they are related to average project duration (days)
It is good practice to run multiple models in order to ensure that the model with greater accuracy and interpretability is selected.
I first partitioned the dataset into a training (70%) and test sets (30%) using a random split. In retrospect, I could have split the model by date in order to determine how my model does at predicting average duration in the future.
I used the glm and gbm functions from the H2O package to make use of the parallel processing cloud server provided by this package as my laptop PC was quite slow. I ran 5-fold cross validation of my training set. In this method, the training set is partitioned into 5 equivalent sets. In each round, one set is used as the test set and all the other are used for training. The model is then trained on the remaining four sets. In the next round, another set is used as the test set and the remaining sets are used for training. The model then calculates the accuracy of each of the five models on the test set and provides an average model accuracy metric.
Output from the glm model is shown below. For continuous target variables, RMSE and R-squared are commonly used as accuracy metrics. We want the RMSE to be as close to 0 as possible and the R-squared value to be as close to 1. In our case, we can see that our model accuracy is awful for both metrics.
Now, let’s look at the output from the gbm. The results from this model are marginally better but nothing to rave about.
Results show that of the 1500 predictors entered in the model (1500 due to binarisation of categorical variables) there are 672 predictors that have some degree of influence. Only, 2 iterations were run despite the model having a choice of multiple runs to produce the best output. The reason for this is because the best value of lambda was reached after two iterations giving a poor model accuracy score (goodness of fit score) of 0.38% R-squared.
Though mean number of records did not come out as a significant predictor in the glm, it appears to be the most important from the gbm, followed by correspondence type and type Name.
Due to the low accuracy for both models, I wouldn’t want to make any deductions from the output.
I decided to further investigate the output by plotting the top predictors. From the box plots below, we can see that average duration is longest for a PM request for approval sample with the highest variation too compared to email and fax that typically have duration close to zero.
In the box plots below, we can see that the average duration is higher for payment claim than design query and non-conformance notice.
The quality of the model depends on the data quality and its features. In this case, we started off with a very small number features and had a very poor understanding of the dataset which led to the removal of a large proportion of the dataset and poor model accuracy despite some feature engineering.
In this example, we found that a mere exploration of the dataset would have answered the question posed by the business around what impacts project duration where would found that payment duration and PM request for approval sample can lead to an increase in duration.
The number of site instructions and variations per project could have again be explored by calculating frequency by project ID and typename.
Here is an example where a predictive model was not required. In order to build a model with better accuracy, additional features/datasets are required along with a better understanding of the dataset.
I would love to hear what you think and whether I could have approached this problem differently! 🙂
All code can be found here: https://github.com/shedoesdatascience/email_analysis/blob/master/email_analysis_documentation.Rmd
This post was originally published by Sheenal Srivastava at Towards Data Science