Supervised vs Unsupervised

There are two types of learning techniques in machine learning :

Supervised learning
Unsupervised learning

Let's first recall some basic concepts of machine learning discussed in the previous chapter the universe of data. It's important to clearly understand the difference between a statistical technique used in the development of an algorithm to create a model.

tip

Reminder : a technique is an existing approach in statistics, such as linear regression, logistic regression, or automatic segmentation ( clustering ). An algorithm is the entire procedure applied to reach a final model that will perform automatic tasks on data. The algorithm, therefore, corresponds to the development of the learning of a model.

Each type of learning includes a set of techniques derived from statistics, and depending on the type of data provided during the creation of the algorithm and the objective of the data project, certain techniques will be selected. It is important to note that in many cases, several techniques will be used to create models that will be evaluated and selected. To create a model, the algorithm will often apply iterations with different cases or architectures to ultimately select the most efficient model.

All these techniques, implemented through the creation of algorithms that will yield a model, can be classified into two types of learning.

Supervised Learning

Supervised learning involves creating a prediction model. This means that based on x - explanatory variables or more commonly known as predictors ( one or more columns of a table ) - the model will predict the values of variable y - target variable or predicted variable ( a column of a table ).

The particularity of supervised learning is that we have data ( records ) that are said to be « labeled », meaning that for the values of columns X ( X₁, X₂, X₃,... ), we know the corresponding values of column Y. The model can then use this information to learn a rule.

This rule can be:

Parametric : meaning that the relationship between the values of columns X ( X₁, X₂, X₃,... ) and the values of column Y is defined by a parameter.
Example : the price of a house based on its square footage. In the same neighborhood and for houses with a set of similar characteristics, the square footage has an impact of x % on the final price. The algorithm will analyze the information at its disposal ( values of X and known values of Y ) and calculate the impact of square footage on the price in such a way that we obtain a precise model, and for any new information regarding X, it will predict a value Y that is as close as possible to the real final value.
Non-parametric : there are no parameters to explain the impact of the values of columns X ( X₁, X₂, X₃,... ) on the values of column Y ; however, the algorithm is capable, based on the data it has, of predicting the values of Y based on the values of X.
For example, a bank wants to develop a system to detect fraudulent bank transactions. It has a history of transactions with characteristics such as amount, location, time, etc., as well as an indication of whether the transaction was fraudulent or not. For a new transaction, it wants to predict whether it is fraudulent or not. In this case, there is no parameter that defines a causal link between the amount of the transfer and whether the transaction is fraudulent or not. However, by identifying similarities between known values of X ( X₁, X₂, X₃,... ) for fraudulent transactions and new values of X for a new transaction, the model will highlight the risk of whether this transaction is fraudulent or not.

The techniques derived from statistics and used in supervised learning can therefore be classified into two categories : parametric techniques and non-parametric techniques. It is not always easy to identify in advance which technique is more appropriate. This is why several models will be evaluated, and the algorithm that produces these models will integrate multiple architectures.

Example : as a car dealer, we would like to estimate the resale price of a customer's vehicle in the context of a trade-in. We have information about the vehicle ( , engine power, number of doors,... = X₁, X₂, X₃,... ) as well as the history of other vehicles and their final sale price. We cannot predict in advance whether applying a parametric technique will yield a better model ( better predictions ) than a non-parametric technique. We will therefore prepare our data and apply a parametric technique ( such as regression ) and a non-parametric technique ( such as K-Nearest Neighbors ). Even within the parametric technique ( regression ), we do not know if linear regression will produce a better result than polynomial regression. Therefore, we will evaluate all cases.

Estimation

Estimation prediction techniques aim to estimate a numerical target variable Y ( continuous ) based on other variables X ( X₁, X₂, X₃,... ) also called predictors.

ClientID	X₁	X₂	X₃	X₄	Y
98	252	5	345	4	652,945 €
97	232	3	304	1	535,975 €
...	...	...	...	...	...

A real estate agency would like to predict the sale price of new houses they are responsible for selling in order to negotiate the best possible commission with their client. For each house, they encode numerous details ( Square footage [column 1], Number of bedrooms [column 2], Energy consumption [Column 3], and Number of bathrooms [Column 4] ). The agency has historical data with all the details of previous sales, including the final sale price [Column 5].
ClientID
X₁
X₂
X₃
X₄

Y
98 252 5 345 4 652,945 €
97 232 3 304 1 535,975 €
... ... ... ... ... ...

In this example, the predictors ( explanatory variables ) would be columns X₁, X₂, X₃, and X₄, and the variable to predict ( target variable ) would be column Y.
If a new client puts their house up for sale with the following data: 99 245 4 325 2 ; the model could determine that the sale price will be 595,000 €. It is important to note that the prediction will never be perfect, but the data scientist's role is to build a model that minimizes errors between the predicted value and the real value.

Among the most commonly used supervised estimation techniques are :

Regressions (linear, polynomial) ;
Neural networks (MLP, CNN, RNN, LSTM, GRU) ;
Regression trees (+Random Forests, XGBoost) ;
K-Nearest Neighbors ;
...

Classification

In the context of classification, prediction techniques aim to predict a categorical target variable Y based on other variables X ( X₁, X₂, X₃,... ) also called predictors. Classification can be binary, meaning it allows for values in column Y of either (1) or (0), or it can involve classification for multiple values ( Y > 2 values, such as « good », « average », « poor » ).

The model will return a value between (0) and (1), and the data scientist must define a threshold ( generally < 0.5 = (0) and >= 0.5 = (1) ). The value actually corresponds to a probability, meaning the probability that the record is equal to (1).

PatientID	X₁	X₂	X₃	X₄	Contagious Y
9856	39.2	145	15/9	75%	1
9857	36.7	68	11/9	68%	0
...	...	...	...	...	...

A hospital wants to implement an algorithm to assess whether a patient arriving at the emergency room is contagious or not. Upon arrival, each patient is taken care of by a nurse, and their vital signs are recorded and encoded into a computer ( Body temperature [column 1], Heart rate [column 2], Blood pressure [Column 3], Muscle mass [Column 4] ). The hospital has historical data with all the details of these parameters and information on the patient's condition, i.e., whether they were contagious or not [Column 5]. In this example, you will identify the variables (the columns in the table) that contain the useful data for your learning as predictors to determine whether a patient is contagious (1) or non-contagious (0).
PatientID X₁ X₂ X₃ X₄ Contagious Y
9856 39.2 145 15/9 75% 1
9857 36.7 68 11/9 68% 0
... ... ... ... ... ...

In this example, the predictors ( explanatory variables ) would be columns X₁, X₂, and X₃ ( column X₄ having no impact on whether the patient is contagious or not ), and the variable to predict ( target variable ) would be column Y.
If a new patient presents with the following data: 9858 38.7 90 13/9 67%; the model could determine that the probability of them being contagious is 0.69; since 0.69 is >= 0.5, the patient would be considered contagious. It is important to note that in the context of classification, the negative hypothesis is always defined as (1) ( for example, contagious ), to favor classifying non-contagious patients as contagious rather than classifying contagious patients as non-contagious.

Among the most commonly used supervised classification techniques are :

Logistic regression ;
Linear and quadratic discriminant analysis ;
Neural networks (MLP, CNN, RNN, LSTM, GRU) ;
Regression trees (+Random Forests, XGBoost) ;
K-Nearest Neighbors ;
...

💡 Important: We've seen that in data science, there are two types of predictions for supervised learning : estimation and classification. It's important to note that some statistical techniques can be used for both estimation and classification, while others are specific to either estimation or classification.

Unsupervised Learning

Unsupervised methods aim to understand and describe data to reveal underlying trends. Unsupervised techniques differ from supervised techniques in that (1) there is no target variable to predict - we use all columns as X, and (2) there is no learning phase as we will discuss in the next chapter. All data is used.

Unsupervised models are also popular and used for many applications :

Segmentation ( clustering ) : identify records that are similar to other records and form a group distinct from other groups ( e.g., better segment the target customer base ) ;

Association rules : identify combinations of products frequently purchased together and implement actions based on this ;

Collaborative filtering : ( considered unsupervised because it does not involve a learning phase on a subset of data but includes a step of a supervised technique ) if we watched movie A and liked it, another person liked movie A and has similar characteristics to us. This person liked movie B, so I will probably like movie B as well ;

Principal component analysis : transform the data space where some are correlated into a new space with fewer dimensions and decorrelated ;

Anomaly detection : Discover abnormal records ;

Supervised Learning​

Estimation​

Classification​

Unsupervised Learning​

Supervised Learning

Estimation

Classification

Unsupervised Learning