Supervised learning refers to a family of machine learning algorithms which learn from “labeled” data. The programmers provide the “correct” answer, so the algorithm actually segments and analyses the data based on these given assumptions. The goal is to predict the right output for the given input.
The machine learns on a set of training data, and the extent and quality of the training data given actually determines how many comparable cases the algorithm has seen and how well it will later predict and determine the correct output for input data that it hasn’t been trained on. This requires skilled data scientists in order to provide good assumptions and effective models which will allow for the assumed “correct” variables to be applicable even on new data. Incorrect or poorly defined data labels can influence the effectiveness of the whole model.
Supervised learning is mostly used for classification and regression. It allows us to analyse data based on previous experience and optimise performance by learning from more and more data.
If we take a very simple example, supervised learning is like when a child is learning with a teacher. The teacher will show the child images of animals and teach it which is a cat and which is a dog. In this same way an algorithm will be given labels for data telling it what is what, and the segmentation will be done according to these labels.
In contrast with supervised learning, unsupervised learning is a family of machine learning algorithms which learn from “unlabeled” data. It is a self-learning technique where the system is learning from the data set by trying to identify and make assumptions on its own, with no given labels and “correct” answers. There is no prior knowledge of the data and categories to be used, so the goal is to identify hidden patterns in a data set and form them into certain groups, or clusters.
The goal is to find existing trends or structures within certain data sets without providing labels or any insights to the algorithm. A typical and most common unsupervised learning algorithm is clustering.
Unsupervised learning allows us to provide data which is very hard or not practical for humans to categorise and analyse, and to allow the algorithm to identify trends and insights which can then be used for further testing. Unsupervised machine learning can find many different unknown patterns in data and identify features for categorisation. It is also much easier to get unlabeled data sets than labeled ones, since you need an experienced person in order to label data, and unlabeled sets can be taken from many different sources.
Taking again the example of a child learning how to recognise animals, unsupervised learning could be compared to a different approach. Let’s say a small child has a cat at home, and at some point the family visits friends who also have a cat, but the child has never seen this animal before, it’s different from their cat and nobody told it that this was a cat. However, based on similar features such as fluffy fur, whiskers, a long tail and the animal walking on 4 legs, the child could recognise that this is also a cat. This would be unsupervised learning, and in a similar way an algorithm working on unlabeled data would also assume common features of the data and form it into clusters.
There is also a third category mixing both supervised and unsupervised learning, which is known as semi-supervised learning. This is a situation where some data in the set are labeled, but most are unlabeled. This data can be analysed using a mix of supervised and unsupervised learning. The algorithm can try to cluster the unlabeled data and feed it back through by comparing it to the labeled data and making a model for the analysis of new data.
A lot of real life machine learning problems fall under semi-supervised learning because, as previously mentioned, labeled data is more complicated and expensive to acquire. By mixing both approaches, some of the data will be labeled, while the rest will be unlabeled and much easier and cheaper to get. The two learning methods will be combined in order to get the best possible performance of the data model under existing conditions, and by processing more and more data the algorithm can be optimised for better performance.
As an example, if you wanted to segment a very large data set of pictures of animals, it would be very expensive to have humans label every single picture. Instead, you could have humans classify only a small part of the pictures and label them as cats, dogs, birds etc. After humans have labeled a small part you have a good base of labeled data from which the algorithm can segment all the other photos. This saves you both time and money, and the algorithm can perform really well and be trustworthy when based on some labeled data instead of starting from nothing and having to recognise the categories on its own. This is the real benefit of semi-supervised learning.
Both supervised and unsupervised learning are very important methods, with different levels of accuracy and different ways of functioning, and the option to combine them together to create an optimal model for use on different sets of data. The important thing to remember is that most algorithms are only as good as the data they are given, especially in regards to labeled data. So even though supervised learning may be more accurate, if it is given bad data or wrong labels, it can lead to untrustworthy results and faulty assumptions. With any machine learning algorithm, we should always keep in mind how it actually functions, and how we can use it in a way that ensures high quality results.