The Data Science ABCs: A whirlwind tour of the field

mediumThis post was originally published by Andre Ye at Medium [AI]

A is for AUC, B is for Batch Normalization…

A is for Area Under Curve (AUC)

The Area Under Curve metric represents the probability that a classifier will be more confident that a randomly chosen positive than a randomly chosen negative example is positive, in the case of binary classification. It is found on a ROC (Receiving Operator Characteristic) Curve, which plots the true positive rate against the false positive rate.

B is for Batch Normalization

Batch Normalization is a layer commonly used in state-of-the-art neural networks. It takes inputs from the previous layer and normalizes it by removing the mean and rescaling the standard deviation. Considered one of the biggest breakthroughs in machine learning, Batch Normalization’s success was once thought to be attributed to “internal covariate shift”, or the heavy shifting of information distributions during training, causing the Vanishing Gradient Problem and hence reducing a model’s efficiency.

A few years after the publishing of the original Batch Normalization Paper, researchers found that the layer’s success could be attributed to the fact that it made the error landscape smoother, allowing the optimizer to explore a smooth terrain and offering less convincing local minima to get stuck in.

C is for Categorical Encodings

Categorical variables are an essential part of any dataset, but a categorical variable with many unique values — say, 50 states — it is infeasible to use traditional encoding methods, being label encoding (not an ordinal variable) and one-hot encoding (result will be too sparse). By using categorical variables, one can avoid the impracticalities of both encoding methods.

The most common specialized categorical encoding is to replace each categorical value with the average target value for that unique value. For instance, if all rows marked “California” have an average target value of 0.34, the categorical encoding would replace call instances of “California” in that column with 0.34. Although there are some downsides, it performs better than other encodings in a majority of situations.

D is for Dropout Regularization

Dropout Regularization is an attempt to prevent a model from overfitting with “dropout layers”, which randomly prevent a percentage of incoming inputs to be blocked, relying the remaining percentage of inputs to pass on important information. As the network works around this disability, the theory goes, it will learn to choose only the most important information to propagate around the network in anticipation that some of it will be blocked.

While Dropout has empirically been shown to be a generally well-performing method, it can cause minor instability in prediction, and as there are many other regularization methods, Dropout may not always be the best choice.

E is for Embeddings

An embedding is a categorical feature converted into a continuously-valued feature, typically representing a high-dimensional vector in low-dimensional space. For example, consider language, notorious for having spare, high-dimensional data and hence a heavy reliance on embeddings; one can model the words in a sentence with two methods:

  • As a very high-dimensional sparse vector in which all elements are integers; each cell in a vector represents a unique word, and the value in that cell represents the number of times the word appears in a sentence. Since a word is not likely to appear in the same sentence more than a few times, almost every cell will be 0. Many algorithms have difficulty working with very sparse data structures.
  • As a low-dimensional dense vector where each element holds a value from 0 to 1, which can be thought of as a ‘compressed’ version of the same information.

These embedding’s are trained just like other parameters in the model. Over time, the model learns to create richly informational ‘compression’ tactics.

F is for Fairness Constraint

Applying a fairness constraint to an algorithm ensures fairness within its decision-making process. For example, one may decide to post-process their model output, or to alter the loss function such that a penalty is incurred when the model violates a fairness metric. Another option is to directly add a mathematical constraint to the optimization problem. The fairness constraint is part of a greater discussion on ethical and/or statistical or human biases in machine learning and data analysis. Fairness constraints need to be carefully placed as to not introduced an additional form of bias (‘unfairness’).

G is for Greedy

A greedy policy in reinforcement learning is one that always chooses the action with the highest direct return. Because these policies are ‘greedy’, they hence only choose one-linked actions with the immediate highest return, and neglect long-term thinking. Sometimes, greedy policies have shown to be good solutions, like in the infamous Multi-Armed Bandit problem.

Greedy optimizers, similarly, are ones that crave for immediate decline in error or increase in performance. Consider, for instant, a naïve optimizer for a neural network in an error space; it will only choose to go in a direction directly in front of it that yields the greatest decline. When it lands in a local minima (a dip in the surface, but not the lowest dip in the entire space), it will refuse to “jump out” of that minima for fear of temporarily increasing error, even if it means a better end-result. Because greedy optimizers tend not to work well with neural networks, various more sophisticated variants utilize concepts of ‘momentum’, which only stop an optimizer if the minima is deep enough.

It’s important to realize, however, that greedy and naïve doesn’t necessarily mean bad in the context of modelling: a model that simply predicts tomorrow’s weather will be the same as tomorrow in a stable climate or one that predicts the stock market will keep going in the same direction it did yesterday won’t perform too badly. Sometimes, simplicity ends up beating complexity in absurd — or beautiful — ways.

H is for Hashing (Features)

Another method to prepare categorical features with a high number of unique values is to hash them. Hashing is a method of establishing relationships between similar classes. For instance, Earth has an estimated trillion number of species; it is infeasible to label-encode, one-hot encode, or even apply a specialized categorical encoding method discussed above, simply because the sheer number of classes creates a complete inability for differentiability between unique values.

We can, however, lump the Common Raven and the Chihuahuan Raven under the same hierarchical category, because they have similar genetics and hence similar properties. In essence, feature hashing reduces a categorical feature with a large number of possible values into one with a vastly reduced number of possible values by grouping them in a deterministic manner.

I is for Intersection Over Union

Intersection Over Union, or IOU, is an important metric in drawing bounding boxes (think the yellow square that pops up over your face when you take a picture on your phone), measuring the area of the intersection of the predicted bounded box and the correct bounded box, divided by the total area taken up by both boxes (counting the intersection only once).

Consider, for instance, the intersection of two bounding boxes (the ground truth bounding box identifies the butterfly).

L is for Log-Odds

The log-odds of an event is the natural logarithm of the odds of some event. Given any probability of something happening p, the odds of that event can be found by dividing p by (1-p), or dividing the probability of an event happening by the probability it doesn’t happen. For instance, if we are 70% sure a horse will win (odds get their origins from horse-racing gambling), our odds will be 0.7 divided by 0.3, or about 2.3, and expresses a simple but important relationship in probability.

The log-odds function is the inverse of the sigmoid function, in that it is just like the sigmoid function but ‘rotated’, with the x and y-axes swapped. It squeezes between x = 0 and x = 1, and has the same relationship between axes as the sigmoid function does. This hints to why the sigmoid function is such an important function in machine learning and modelling; besides being a convenient function with a nice shape, set of properties, and derivative, as is the justification in many articles and courses, it has a very strong connection with the fundamental roots of probability, which is why it generally performs better than other ‘convenient functions’ without such involvement with probability.

M is for Matrix Factorization

Matrix Factorization is a recommender system algorithm that takes in an interaction matrix — where rows and columns represent sets of identities, and cells represent values of their interactions, like users’ ratings for movies — with missing values and attempts to fill them in.

Given an m by n matrix, the matrix factorization algorithm will factor the matrix into two “sub-matrices” with dimensions m by k and k by n, such that when multiplied by each other, the result is a m by n matrix. In this case, k is an almost-arbitrary number that can be set to anything from 2–5, dependent on the scenario. The trick in matrix factorization is to evaluate the performance of these sub-matrices based on how close their predictions are to the known values, and assuming that if the model is able to closely replicate the known values in the matrix, it will be able to correctly predict unknown values.

Matrix Factorization is known to be a very computationally expensive task — while multiplying two matrices is analogous to summarizing Shakespeare, factoring it is like writing Shakespeare from a summary. Because of this and the availability of other less-expensive alternatives, it has generally fallen out of favor with heavy corporate use but still shows up in smaller datasets or in applications that do not need to make on-the-spot predictions (like marketing campaigns).

N is for N-Gram

An N-Gram is jargon commonly used in the field of Natural Language Processing to refer to a text that contains N words. For instance, the phrase “machine learning rocks!” is a three-gram, or a trigram. An important differentiating factor between N-grams and other forms of data fed into natural language processing models like one-hot encoded data is that it retains the sequential aspect of its data.

Whereas naïve methods of encoding do not give the model any information on the order of words in a phrase, training a model on N-grams yields more satisfactory results, which can take into account complex aspects of language like sarcasm, idioms, and multi-meaning words based on context. State-of-the-art NLP models that perform incredibly on standard language tasks like question-answering or difficult classification almost exclusively operate on N-gram inputs.

O is for Out-Group Homogeneity Bias

Out-Group Homogeneity Bias is a tendency to see entities outside a group as being more alike than members inside a group when comparing various aspects like personality or values. The in-group refer to people or entities you interact with commonly, whereas the out-group refers to those you do not commonly interact with. If a dataset is created by asking people about out-groups, the results will be less nuanced and more stereotypical than attributes respondents give for an in-group.

Members of Colony A may be very well-versed and specific about their neighbors, discussing in-depth discrepancies in taste, styles, and habits, whereas they would generalize members of Colony B as living in the same houses and having identical traits.

Out-Group Homogeneity Bias comes from a group of biases that are too often ignored in machine learning. The water is only as good as the ice caps are pure, and hence, the model and analyses are only as good as the data is honest. Messy data collection will lead to a dishonest and flawed result, but worse, give the data scientist false confidence in their work. Out-group homogeneity bias and many other forms of group attribution bias plague survey data constantly, which can lead organizations to arrive to false conclusions on important issues, be it an election forecast or an important company decision.

P is for Perplexity

Perplexity is an important metric for a model in real-time applications, and measures its confidence as it is updated with new information. For instance, right in your browser, perplexity might measure the number of characters a user must type into their search bar before a search engine obtains over 60% confidence the user will search x query. In applications of models where time is an important factor, simple accuracy variants fail to measure the quick and decisive nature required for that kind of task. Of course, perplexity is only a suitable metric once a model has been trained and is ready for deployment.

Q is for Q-Learning

Q-Learning is a basic form of Reinforcement Learning, using Q-values, or action values, to recursively improve the behavior of an agent. Q-Learning is defined by four components:

  • Q-values (Action values). These are defined for states and actions, and are an estimation of the goodness of taking an action at the current state. This estimation is iteratively computed.
  • Rewards & Episodes. An agent begins its lifetime from a start state and makes a number of connections from a current state to the next state, based on the environment it is interacting in. At every step, the agent takes an action, gets a reward (or a penalty) from the environment, and transports to another state. If the agent arrives at a terminating state (no further transition is possible), an episode has been completed.
  • Temporal Difference (TD Update). The Temporal Difference rule is an update rule used to estimate the value of Q, applied every time step of the agent’s interaction with an environment. Various terms involved in its calculation involve the current state & action of the agent, the next state & action of the agent, the current reward (or penalty) observed, as well as a “discounting factor”, which determines how risky the algorithm is.
  • Action Chooser (Policy). Q-learning uses a “greedy policy” in that it simply chooses the action with the best Q-value estimation. Of course, it is dependent on the estimations of the Q-value, which includes a discounting factor, capable of controlling the “greediness” of the algorithm in how hungry it is for immediate reward.

R is for Regularization

Regularization is a penalty on the model’s complexity, and helps prevent it from overfitting. There are various types of regularization, including Dropout specifically for neural networks, but the mostly generalized are L1 and L2, which can apply to almost any machine learning model. L1 regularization weights errors by their absolute value, so an error of 4 is a penalty of 4. This also means that a decrease of penalty from 5 to 4 is considered the same as from 1 to 0.

On the other hand, L2 regularization weights errors at the square of their actual value, hence allowing more tolerance for smaller, natural variance and seeking to eliminate unnaturally large errors. A penalty of 4 is weighted as 16, and a decrease of error from 4 to 3 is considered to be 16–9 = 7, whereas a decrease of error from 1 to 0 is considered to be 1–0 = 1.

This means that models trained on L2 tend to have may coefficients that are close to zero, but are not actually zero, because the regularization penalty gives them no big reward for doing so. On the other hand, L1 regularization encourages coefficients to keep on decreasing to 0 if it is profitable for it to do so, since a decrease in error of 1 is the same regardless of what value it decreases from. Which regularization method to use is dependent on the model type, architecture, parameters, and attributes of the data.

You can read more about differences between linear models trained using L1 and L2 regularizations (as well as mixes between them) here.

S is for Semi-Supervised Learning

Semi-supervised learning is a mix between supervised learning and unsupervised learning. It was created to address the increasingly common problem of supervised learning failing to operate on a lack of labelled data. As the amount of unsupervised data — particularly on user interactions on the Internet — rise exponentially, semi-supervised learning takes the best from both the supervised and unsupervised learning worlds so data scientists don’t need to choose between just one.

Ideally, a semi-supervised learning dataset would be majority unlabeled and minority labeled.

Consider, for instance, the semi-supervised GAN (Generative Adversarial Network), which trains the discriminator to not only classify between real (taken from the dataset) or fake (artificially generated by the generator) but to also perform a supervised learning task, such that the discriminator outputs n+1 nodes, where n is the number of classes in the supervised learning task and 1 represents the one node for a real/fake binary output. By utilizing the structure of the GAN, the reasoning goes, the important information it identifies in the image through unsupervised learning more than makes up for the loss of labelled data, boosting the supervised learning task.

T is for Transfer Learning

Transfer learning — a practice growing increasing popularity in the machine learning community — is the process of transferring information from one machine learning task to another. For example, consider multitask learning, in which a single model most solve many tasks at once (has different output nodes for different tasks). There are a few creative applications of this idea:

  • With semi-supervised GANs, as seen above, the discriminator performs two tasks at once and hence is able to perform the supervised task better with help from the unsupervised task.
  • Consider a neural network ensemble that produces an output, as well as how much it deserves its voice to be heard in the final aggregation of votes. (This turns out to be pretty successful, and more so than simplify using ‘soft voting’, or weighting votes by confidence.)

Most machine learning systems can solve only a single task, and this limitation has been one of the biggest points made by those who argue against real artificial intelligence. Multitask learning is a baby step towards “real” artificial intelligence in its capability to solve multiple tasks at once.

Beyond multitask learning, consider the concept of the pretrained model, which has gained immense popularity in image processing and language modelling, where performing the computation for such a complex model every single time would be inefficient. Instead, if one wants to train a model to classify images, they would take a model architecture pretrained on, say, the ImageNet dataset, and train the model further to tweak the parameters for their specific task. Because a large part of the work is already done — in the case of image processing, the capability to recognize important parts of a picture; and in the case of NLP, a fundamental understanding of words — building models for specific tasks becomes much more efficient.

U is for Upweighting

Upweighting is part of a greater discussion on addressing class imbalance, or the discrepancy in equal representation of labels for data points, involving the idea of downsampling.

Downsampling is a method to address class imbalance problems, in which a model naturally tends to side with one class over the other on randomly selected data points with the same percentage as they are represented in the dataset. For instance, in a cancer dataset with 2% positive and 98% negative cases, a model may be inclined to answer ‘negative’ much more often than ‘positive’. It also may tend to focus on learning more about the majority class than the minority class. Although there are methods to address this, like choosing a specialized metric to emphasize predictions on a certain class value, the easiest method is to simply downsample, or reduce the number of instances of, the majority class, by randomly selecting rows.

Rephrased, to downsample is to train the model on a disproportionately lower percentage of an overrepresented class in order to improve the model’s training and understanding of underrepresented classes. Upweighting, however, applies a weight to each of the randomly selected downsampled data points proportionate to the factor at which they were downsampled. This is done to still retain an importance on the downsampled class, and has many benefits in training:

  • While training, because the minority class is seen more often, the model converges faster.
  • Upweighting allows our model to remain well-calibrated, and to ensure the interpretability of outputs as honest probabilities.
  • By reducing the majority class into a fewer number of examples with proportionately larger weights, less computational space is spent storing them, allowing more space for the minority class. This allows the model to utilize more abundant and wise examples from that class.

V is for Vanishing Gradient

The Vanishing Gradient Problem occurs in neural networks that are too deep, or ‘long’, in architecture. Because they are so long, when the backpropagation signal that updates the parameters of the network spreads backwards throughout the network, the signal gradually diminishes such that the parameters at the front are left untouched and therefore take up extra computational space without the benefit, resulting in inefficiency and limitation.

One reason for the vanishing gradient problem is the sigmoid function’s use as an activation function; because its derivative resembles a Bell curve of sorts, with possible information yielded dramatically sloping to zero for any ‘extreme’ values, any distribution not centered at zero will inevitably have part or all of its value clipped away. Because distributions of inputs the shift around during earlier parts of training — when parameters undergo severe changes — useful information can never be propagated to the front of the neural network, causing a vicious cycle of constant absence of improvement.

W is for Wasserstein Loss

Wasserstein Loss is one of the most common loss functions used for Generative Adversarial Networks, and is based on the Earth-Mover’s Distance between the distribution of generated data (from the generator) and the real data (pulled from the dataset). Ideally, the generated data’s distribution will closely resemble the ‘real’ one pulled straight from the dataset. Wasserstein Loss is the default loss function for GANs in TensorFlow.

Earth-Mover’s Distance (EMD) can be informally thought of as interpreting distributions as methods of piling a certain amount of dirt over a region, the EMD is the minimum cost of turning one pile into the other, assumed to be the amount of dirt moved multiplied by the distance in which it is moved. This type of distance is commonly used in other machine learning concepts to compare the similarity of distributions.

Y is for “Y is My Model Not Working?!”

Admittedly, one would be hard-pressed to find a machine learning term that begins with “Y”, but “Y is my model not working?!” is perhaps one of the most common things you’ll hear data scientists say (or think to themselves).

Z is for Z-Score

A z-score is simply the number of standard deviations a data point is away from the mean. Checking the z-score is common in statistical outlier detection algorithms, and it’s always a good idea to visualize the z-scores of data points in your data to understand important attributes of the distribution, like how spread out it is, in preparation of other data preprocessing methods, like scaling or normalization.

Spread the word

This post was originally published by Andre Ye at Medium [AI]

Related posts