Machine Learning

Artificial Intelligence (AI) and Machine Learning (ML) are among the top trending technology keywords currently. While some people use these terms interchangeably, ML is really a subset of AI and the definition here gives a good idea:

  • AI is the wider concept of machines being able to execute tasks in a way that we would consider “smart”.
  • ML is an active application of the AI-based idea that we should be able to give machines way into data and let them learn by themselves.

Machine Learning is the field in focus nowadays with technology companies leading the way over the last decade and organizations across other domains following suit to provide differentiated offerings to their customers leveraging ML. All this was made possible by computer and information technology advances during the last decade:

  • Wide availability of GPUs that has made parallel computing cheaper and faster.
  • Data storage becoming cheaper enabling infinite storage at a fraction of the cost compared to few years ago.
  • Ubiquitous mobile phones with internet access creating a flood of data of all stripes – images, text, mapping data, etc.

In fact, the above three advances have transformed IT strategy across industries beyond just ML. Industry leaders are aggressively pursuing cloud first strategy, re-engineering their applications into microservices based architecture and generating insights for businesses and customers using data science that leverage machine learning algorithms.

It is important to get a good understanding of these technologies to transform existing platforms. When I was looking for a way to get hands-on understanding on Machine Learning, one of my colleagues suggested the online course offered by Stanford University and taught by the renowned ML researcher Andrew Ng. This course has programming assignments that requires one to write algorithms and solve real world problems in either Octave or Matlab. You can find my completed assignments on my github page. This blog summarizes my learning from this course – one that offered one of the most interesting learning experiences for me! Kudos to the course!!!

Lets start with a machine learning definition quoted in the course: Well-posed learning problem by Tom Mitchell (1998) – A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with experience E.

Supervised Learning: we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

  • Regression: predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.
  • Classification: predict results in a discrete output. In other words, map input variables into discrete categories.

Unsupervised Learning: allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

Some of the key components of machine learning algorithms along with how they are denoted in this course:

  • Input or Feature: denoted by x or X (uppercase denotes vector)
  • Output or Label: denoted by y or Y
  • Training example: a pair of input and corresponding output (ith pair) – (x(i), y(i))
  • Training set: a list of m training examples
  • Number of features: denoted by n
  • Hypothesis: function h is called hypothesis. Given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y
  • Parameter estimates or Theta: denoted by Θ. Every feature (xj) will have a parameter estimate (Θj) and adding a bias parameter (Θ0) helps shift the output on either side of the axis.
  • Cost function:  denoted by J(Θ0, Θ1), “Squared error function” or “Mean squared error” used to measure the accuracy of hypothesis function.
    J(θ0, θ1) = ⁄ 2m ∑ (i=1 to m) (hθ(xi) − yi)2
  • Gradient descent: we have our hypothesis function and we have a way of measuring how well it fits into the data. Gradient descent helps us to arrive at optimum parameters in the hypothesis function.
    θj := θj − α ⁄ ∂θj J(θ0, θ1)
  • Learning rate / step: denoted by α (size of each gradient descent step)
  • Feature scaling: dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
  • Mean normalization: subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
  • Polynomial regression: creating a better fit for the curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form)
  • Underfitting or high bias: when the form of our hypothesis function maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.
  • Overfitting or high variance: caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
  • Regularization parameter or Lambda: denoted by λ. It determines how much the costs of our theta parameters are inflated and can smooth the output of our hypothesis function to reduce overfitting.
  • Learning curves: Plot of training error and test (cross validation) error curves with training set size on x-axis and error on y-axis. Lack of convergence of these curves with increasing training set size indicates high-bias (underfit) whereas high-variance (overfit) scenarios will converge as more training examples are made available.
  • Metrics for skewed classes:
    • Precision (P): “true positives” / “number of predicted positive”
    • Recall (R): “true positives” / “number of actual positive”
    • F1 score: (2*P*R) / (P+R)
  • Decision boundary: Specific to classification algorithms – is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

Machine Learning algorithms:

  • Linear Regression
  • Logistic Regression
  • Support Vector Machines (SVM): An alternative to logistic regression for classification problems:
    • Use a kernel like Gaussian kernel to come up with hypothesis.
    • Appropriate when n is small and m is large.
    • As the number of features increase, computation of Gaussian kernel slows down.
  • K-Means Algorithm: Unsupervised learning algorithm for identifying clusters in a dataset.
  • Dimensionality reduction / Principal Component Analysis: Compression reduces memory needed to store data and also speeds up learning algorithm.
  • Anomaly Detection: Used when a dataset comprises a small number of positive examples. Typical use cases:
    • Fraud detection
    • Quality testing in manufacturing
    • Monitoring computers in a data center
  • Recommender Systems
  • Stochastic & mini-batch Gradient Descent: for ML with large datasets

Neural networks: This forms the base for deep learning and this course provides an introduction to this complex area. Neural network model is based on how our brain works with millions of neurons – neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons).

In our model, our dendrites are like the input features x1…..xn, and the output is the result of our hypothesis function. In this model our x0 input node is called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification and call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”. Our input nodes (layer 1), also known as the “input layer”, go into another node (layer 2), which finally outputs the hypothesis function, known as the “output layer”. We can have intermediate layers of nodes between the input and output layers called the “hidden layers.”

“Backpropagation” is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.

Steps to setup and train a neural network:

  • First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.
    • Number of input units = dimension of features x(i)
    • Number of output units = number of classes
    • Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
    • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer
  • Randomly initialize the weights
  • Implement forward propagation to get hΘ(x(i)) for any x(i)
  • Implement the cost function
  • Implement backpropagation to compute partial derivatives
  • Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  • Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

Overall, an excellent course to get started with Machine Learning and get insights into the most commonly used ML algorithms.