Every company generates and consumes data, lots of data. If you work with data, you can not avoid machine learning. Google cloud chief scientist and Stanford professor Feifei Li once said: “In a few years, if you are a software engineer and you don’t understand machine learning, it’s like you are saying you are a SDE, but you don’t know how to code.” Also, nowadays, we can’t talk about machine learning without touching on deep learning.

I’m not a software engineer nor a data scientist, but I do work with them all the time. I understand that being technical enough is critical for a smooth collaboration with engineers, and I believe this is true for all the product managers out there. Thus, I decided to write this post to introduce deep learning from a PM’s perspective. The focus of this article is more on the concepts rather than the detailed math or coding. I’ll take a classic machine learning problem: image classification as an example. We’ll start with conventional machine learning techniques, and gradually move onto neural networks, and finally we’ll review a special form of neural networks designed for image related problems: convolutional neural networks, or CNN in short.

Pre-requisite: basic understanding of machine learning: e.g. I expect that you understand the difference between supervised learning and unsupervised learning, and you can explain what test set, validation set, and training set are for.

## Machine learning approach for image recognition

The goal of image classification is to assign an input image one label from a fix set of categories. This sounds simple: When human looks at a picture of cat, we can easily tell that is a cat, not a horse. Thanks to the super power of our brain system, this is just perceptually easy for us. But this is not the case for computers. Images are just arrays of pixels (numbers) to computers, and translating these numbers into a name of a real object is not trivial. Actually, image classification remains one of the major challenges in the field of computer vision, which has not been completely and nicely solved yet.

So how do we approach an image classification problem with machine learning? Let’s start with a conventional method: Nearest neighbor (NN) classifier. As usual, we split our collection of labeled images into training set, validation set, and test set. The nearest neighbor classifier will take a test image, compare it to every image in the training set, and assign the label of the ‘closest’ training image to the test image. How do we define ‘closest’? We calculate the ‘distance’ between two images. Recall that images are arrays of numbers, and we can treat each image as a vector I. Then we compute the so-called L1 or L2 distance between image I1 and image I2, pixel by pixel.

Study shows that NN classifier with L1 distance can only achieve 38% accuracy on CIFAR-10 data sets, which is way lower than human performance (estimated at 94%). Can we do better? Yes, we can. Instead of using the single closest image, we should expect better result if we use closest 3 images, or 5 images, with a majority vote, since this method is less sensitive to variance. We call this k-nearest neighbor classifier (KNN). K is a hyper-parameter, and it’s tuned using validation set.

Fig. Illustration of k-NN: for k=3, the star will be classified as purple;

for k =6, the star will be classified as yellow

KNN is very easy to implement and interpret, and the training phase is very fast. Actually, there is no training at all, all that we do is to store all the training images. However, the testing phase is very slow, since we need to compare the test image to every single image in the training set. This is the biggest drawback of KNN, and the reason why it’s rarely used for image classification in real life. The other disadvantage of KNN is that we need to store all the training set images so that we can use them during testing phase, and this requires lots of memory. In practice, we prefer some algorithm which can be computationally heavy at training phase, but very light at testing phase. Also, we hope once the training is done, we can discard all the training data, only carry over the learned parameters to testing phase. For that, let’s turn to a more sophisticated and powerful model.

## More advanced model: Softmax classifier

To build this model, we need to define two components: a score function which maps the input data to class scores, and a loss function (or objective function) that measures how close the class scores are to the ground truth. The goal is to minimize the loss with respect to the parameters of the score function.

**Score (mapping) function**: We’ll start with the so-called linear function, the simplest possible mapping function.

Xi is the input, the image vector in our case. W is often referred to as weights, and b is called bias. Note that all the variables and parameters in the function are vectors, and all the operations are vectorized.

Fig. Illustration of mapping an image to class scores using linear mapping.

To calculate Train score: 0.2×32 + 0.5×58 + (-1.1)x132 + 1.8×86 + 3.4 = 48.4

Example above: We assume the image has only 4 pixels, and they are stretched into a single column (xi). There are 3 classes: train, car, and ship. We perform dot multiplication between weights W and input pixels X, and then add the bias b to get the class scores. The result shows that ship class gets the highest score while the correct class shall be car, which means this particular set of parameters (W and b) is no good. We should optimize them by minimizing a loss function.

**Loss function**: Recall that a loss function is to measure the difference between the prediction of our classifier and the ground truth. In the example above, the ground truth is ‘car’, but our prediction indicates ‘ship’. We need a loss function to quantify this ‘difference’. Intuitively, the loss will be high if our classifier is doing a poor job, and the loss will be low if it’s doing well.

We’ll use a popular loss function for classification tasks: cross-entropy loss

where fj means the j-th element of the class score vector f, while yi indicates the correct class. The function inside of the log operation is called softmax function, and a classifier consists of a linear mapping and a cross-entropy loss is called a softmax classifier. the class scores of a softmax classifier is very easy to interpret, since they are presented in probabilities, and sum up to 1. Let’s see how we translate the class scores into cross entropy loss.

Assume that class 3 is the correct class, and the scores are computed by linear mapping. the cross-entropy loss for this set of parameters is 0.45.

Besides cross-entropy loss, other often used losses are hinge loss, L1 loss, and L2 loss, just to name a few.

**Regularization**: One of the major issues of training a machine learning model is overfitting. This happens when the model is trying too hard to capture the noise in the train data so that it has poor generalization power. To solve this problem, we need regularization. We add a penalization term to the loss function, which discourages large weighs or more complex model. The most common regularizations are L1 norm (Lasso) and L2 norm (Ridge):

λ is a hyper-parameter called regularization parameter, which defines the strength of regularization applied. Note that regularization has nothing to do with input, it’s only related to the weights, and usually we don’t regularize bias.

## Model Optimization

The goal of training a model is to find the set of parameters (W and b) that minimizes the loss function. A naive idea, and a very bad one is: random search. We just randomly try out various sets of W and b and keep track of the one which works the best. With this method, we expect the accuracy of the model will not be much higher than 10% on CIFAR-10 dataset, since there are 10 categories, so 10% is what we get by random guessing.

What is actually used in practice is called Gradient Descent, which is the most used learning algorithm in machine learning. We treat the optimization problem as finding the global minimum (hence the lowest loss) of the loss function by following the gradient. We know that the slope of a function is the derivative of the function with respect to a parameter. This slope always points to the local minimum. A caveat is that often times, our optimization process will get stuck at one local minimum point so that the final loss is sub-optimal.

Fig. Global and local maximum/minimum of an example 1D function

Now that we have found the direction we need to nudge the weight, we need to find how much to nudge the weight. Here, we use the Learning Rate. Learning rate is a hyper-parameter, just like the K value for KNN algorithm, and again, we tune this parameter with validation set.

Fig. Gradient descent with effects from different learning rate

Until this point, we can have a formal definition of Gradient Descent (GD): Gradient descent is the procedure of repeatedly evaluating the gradient of the loss function with respect to its parameters, and then performing a parameter update at certain step size (learning rate), so that we reach the global minimum of the loss function when the process is done, which corresponds to the lowest loss.

There are 3 major types of GDs:

- Batch GD: Performing one parameter update after looping over all the training examples.
- Mini-batch GD: You probably realize that it seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update, especially when the training set is huge. We can actually process n training examples at once for one parameter update.
- Stochastic GD: This is the extreme case of Mini-batch GD, that is, n =1. We only use one training example for one parameter update.

The technique we use to compute the gradient is called backpropagation, which is a way of computing gradients of expressions through recursive application of chain rule. I’ll skip the details on this topic, but I highly recommend this post: A Step by Step Backpropagation Example. After reading, you should have an intuitive understanding how backprop works.

## Summary

To tackle an image classification problem, we started with conventional machine learning technique such as k-nearest neighbor (knn) algorithm. We examined the advantages and drawbacks of knn, and decided that a more advanced model was needed. For that, we turned to softmax linear classifier. A softmax classifier consists of two major parts: a linear mapping function which maps the input to class scores; a loss function which translates the scores into loss. Training the classifier means optimizing the loss, and mini-batch gradient descent was suggested for that task. Linear classifier works OK but not great. For better result, neural network is what we need.

Oh, wait, the title of this post is ‘deep learning’, but you may have realized that we haven’t touched anything on deep learning yet! So what are we doing here? 🙂 Alright, that’s a fair question to ask. It appears that we’ve spent tons of time talking about things unrelated to the topic we want to discuss, but that is not true. The foundation of knowledge we’ve built here assures us of a smooth transition to neural networks and deep learning in the next step. All the concepts and components covered in this post still apply to deep learning, but they just get more complex. In the next post, we will start to define Neural Networks.

Until next time, happy learning 🙂

## Credit

The content of this article is inspired by Stanford course CS231n: Convolutional Neural Networks for Visual Recognition. It’s an amazing course, and the lecturers are purely fantastic. I highly recommend this course to all the readers. All course videos are available on YouTube.