Reducing the Artificial Neural Network complexity by transforming your data

towards-data-science

This post was originally published by Walinton Cambronero at Towards Data Science

A practical example in a hard-to-classify dataset

Introduction

The need to reduce the complexity of a model can arise from multiple factors, often to reduce the computational requirements. However, complexity can’t be arbitrarily reduced because, after many iterations of training and testing, that’s the one model that provided good results. Research on this topic is active, e.g. [Koning et al., 2019] propose a solution to this same problem for CNNs used for exoplanet detection:

Convolutional Neural Networks (CNNs) suffer from having too many trainable parameters, impacting computational performance … We propose and examine two methods for complexity reduction in AstroNet … The first method makes only a tactical reduction of layers in AstroNet while the second method also modifies the original input data by means of a Gaussian pyramid

The second method (modifying or transforming the input data) is common. According to Google’s Crash Course on Machine Learning, transformations are done primarily for two reasons:

  1. Mandatory transformations: it makes the data compatible with the algorithms, e.g. converting non-numeric features into numeric.
  2. Quality transformations: it helps the model perform better, e.g. normalizing numeric features.

The kind of transformation proposed by [Koning et al., 2019], and the one proposed in this article, fit in the second category.

Objective

I present a linear data transformation for the Poker Hand dataset [Cattral et al., 2007] and show how this transformation helps reduce the model complexity for a Multi-layer Perceptron (MLP) Neural Network while maintaining the classifier’s accuracy and reducing the training time up to 50%. The Poker Hand dataset is publicly available and very well-documented at the UCI Machine Learning Repository [Dua et al., 2019].

In a previous story, I talked about the Poker hand dataset. A 3-layers MLP performed relatively well. Today, I show that it is possible to achieve equivalent accuracy with a less complex model by understanding the data we’re working with and transforming it to make it more appropriate for the problem we’re trying to solve.

Dataset description

This particular dataset is very human-friendly. It uses 11-dimensional description of poker hands by explicitly listing the suite and rank of each card, and the associated poker hand. Each data instance contains 5 cards.

Encoding

The following is the dataset encoding description. For details follow this link.

Suite: 1: Hearts, 2: Spades, 3: Diamonds, 4: Clubs
Rank: 1: Ace, 2:2, …, 10: Ten, 11: Jack, 12: Queen, 13: King
Hand: 0: Nothing 1: Pair 2: Two pairs, …, 8: Straight Flush 9: Royal Flush

Example

One encoding for the Royal Flush of Hearts (can have multiple representations using this model) is:
Data: 1,1,1,10,1,11,1,12,1,13,9
Interpretation: Hearts-Ace, Hearts-Ten, Hearts-Jack, Hearts-Queen, Hearts-King, Royal-Flush

Royal Flush of Hearts Photo: Graeme Main/MOD

The transformation is based in the fact that the order in which the cards appear (in a hand) doesn’t matter (to classify the hand), and that a more important attribute for classifying a hand is the number of cards (i.e. cardinality) with the same rank or suite that appear in the hand. The original dataset model gives an artificial importance to the order in which the cards appear (samples are ordered lists of 5 cards) and it does not explicitly encode the cardinality of each suite or rank. The premise is that by making this attribute explicitly available in the data, a Neural Network is able to better classify the dataset, in comparison to the same Neural Network when using the original model in which the attribute is hidden.

Linear transformation

The following is a linear transformation from the original 11D space to a new 18D space. A linear transformation is preferable due to its reduced computational requirements. The new dimensions and descriptions are:

Attributes 1 through 13: The 13 ranks, i.e. 1: Ace, 2: Two, 3: Three, …, 10: Ten, 11: Jack, 12: Queen, 13: King.
Attributes 14 through 17: The 4 suites, i.e. 14: Hearts, 15: Spades, 16: Diamonds, 17: Clubs
Domain: [0–5]. Each dimension represents the rank or suite cardinality in the hand.
Last dimension: Poker hand [0–9] (unchanged).

Encoding and example

The following is an example transformation for the Royal Flush of Hearts.

Representation in original dimensions (11D):

Data: 1,1,1,10,1,11,1,12,1,13,9
Encodes: Hearts-Ace, Hearts-Ten, Hearts-Jack, Hearts-Queen, Hearts-King, Royal-Flush

Representation in new dimensions (18D):

Data: 1,0,0,0,0,0,0,0,0,1,1,1,1,5,0,0,0,9
Encodes: 1st column = 1 Ace, 2nd through 9th column = nothing (no cards with that suite), 10th through 13th columns = 1 Ten, 1 Jack, 1 Queen and 1 King, 14th column = 5 Hearts, 15th through 17th columns = nothing (no cards with that suite) and 18th column = Royal Flush.

The following image shows the visual transformation for this particular example.

Linear Transformation from 11D to 18D (Image by author)

The new model represents any given a combination of 5 cards the same way regardless of order and explicitly exposes information useful for Poker hands such as the number of cards of the same Rank.

Tools

Scikit-learn, Numpy and Seaborn are used for the Machine Learning, Data Processing and Visualization, respectively.

Where is the code?

A Jupyter notebook with the MLP, visualization and the linear transformation is here. The Classification Report and Confusion Matrix for each experiment are included in the Jupyter notebook too.

In my previous story, I showed that a MLP with 3 hidden layers of 100 neurons each, with alpha=0.0001 and learning rate=0.01 using the original dataset, achieves an ~78% accuracy. These hyper-parameters were found after running an extensive grid-search over a wide range of values. So, the following measurements will be made based on these same values.

Metrics

The MLP accuracy is measured with the F1 macro-average metric. This is an appropriate metric for the Poker hand dataset as it deals nicely with the fact that this dataset is extremely imbalanced. Scikit-learn’s documentation:

The F-measure can be interpreted as a weighted harmonic mean of the precision and recall … In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance

The Classification Report is shown for the different experiments. It contains the macro-average F1 metric, among others.

In addition, the MLP training time is measured and reported.

3 hidden-layers MLP with original data

Complexity: Each hidden-layer has 100 neurons.

Accuracy: For the 3-layers MLP and the original data (no transformation applied yet), a ~80% accuracy in the F1-score macro-average is obtained. Refer to the previous post for details on how this result was achieved.

Training time: 20+ seconds

Classification report

              precision    recall  f1-score   support
0       1.00      0.99      0.99    501209
1       0.99      0.99      0.99    422498
2       0.96      1.00      0.98     47622
3       0.99      0.99      0.99     21121
4       0.85      0.64      0.73      3885
5       0.97      0.99      0.98      1996
6       0.77      0.98      0.86      1424
7       0.70      0.23      0.35       230
8       1.00      0.83      0.91        12
9       0.04      0.33      0.07         3
accuracy                           0.99   1000000
macro avg       0.83      0.80      0.78   1000000
weighted avg       0.99      0.99      0.99   1000000

2 hidden-layers MLP with transformed data

In this experiment, the model complexity is reduced by dropping one hidden-layer of 100 neurons, and the transformed (18D) data is being used. Everything else remains identical.

Accuracy: For the 2-layer MLP with the transformed data, it can be observed that ~85% accuracy is obtained.

Training time: 10–15 seconds

              precision    recall  f1-score   support           0       1.00      1.00      1.00    501209
1       1.00      1.00      1.00    422498
2       1.00      1.00      1.00     47622
3       0.97      1.00      0.98     21121
4       1.00      0.99      1.00      3885
5       1.00      0.98      0.99      1996
6       0.83      0.48      0.61      1424
7       1.00      0.41      0.58       230
8       0.38      0.75      0.50        12
9       0.50      1.00      0.67         3    accuracy                           1.00   1000000
   macro avg       0.87      0.86      0.83   1000000
weighted avg       1.00      1.00      1.00   1000000

1 hidden-layer MLP with transformed and original data

Accuracy:
With a single layer of 100 neurons, the MLP with the transformed data achieved ~70% accuracy. With the original dataset it achieved ~30% accuracy.

Training time:
~10 seconds for the transformed dataset, ~12 seconds with the original data.

Other experiments

Feel free to take a look at the Jupyter notebook that has the code and results for these and other experiments.

By applying a simple linear transformation that makes the dataset less human-friendly but more ML-friendly, I show that a simpler MLP model provides equivalent results in less computational time. Specifically, a hidden-layer of 100 neurons is removed without compromising the performance of the classifier. The results show that the Neural Network accuracy is similar or better than the one achieved by the more complex model, and the training time is reduced by 25% to 50%.

Cattral, R. and Oppacher, F (2007). Poker Hand Data Set [https://archive.ics.uci.edu/ml/datasets/Poker+Hand]
Carleton University, Department of Computer Science.
Intelligent Systems Research Unit

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Sebastiaan Koning, Caspar Greeven, Eric Postma (2019) Reducing Artificial Neural Network Complexity: A Case Study on Exoplanet Detection. https://arxiv.org/abs/1902.10385

Spread the word

This post was originally published by Walinton Cambronero at Towards Data Science

Related posts