Association Rules
The association rule technique is an unsupervised technique used to identify significant relationships ( rules ) between variables in a large dataset. In simplified terms, an association rule highlights relationships in the form of « if A, then B » or « if A and B, then C ».
The association rule is often exemplified by a supermarket shopping cart filled with various products purchased by a consumer. It helps identify, based on all consumer data, that if a consumer buys product Z, they are likely to also buy product Y, or at least, they are more likely to buy product Y if we highlight it. Therefore, the association rule determines which products are frequently bought together.
In other cases, an association rule can help identify actions that are often performed together. In fact, association rules are widely used today, often integrated with other techniques, such as fraud detection, recommendation systems, social media user interactions, and more.
Data Preparation
An association rule is based on a set of transactions ( or records ) containing one or more products / services ( or actions ) and some basic information related to the operation. For better efficiency, we gather the key information into a table.
For example, in a supermarket, we might have the transaction ID, customer ID, timestamp, and the list of purchased products.
TimeStamp | TransactionID | CustomerID | Items |
---|---|---|---|
04/01/2024 17:05 | 45784548 | 145741 | Orange Juice, Soda |
04/01/2024 17:06 | 45784549 | 004785 | Milk, Orange Juice, Window Cleaner |
04/01/2024 17:08 | 45784550 | 093689 | Orange Juice, Detergent |
04/01/2024 17:09 | 45784551 | 457846 | Orange Juice, Detergent, Soda |
04/01/2024 17:10 | 45784552 | 336478 | Window Cleaner, Soda |
... | ... | ... | ... |
The elements in the table used to identify rules must be binarized - an operation that can be done using one-hot encoding - to obtain the following table :
Transaction ID | Orange Juice | Soda | Milk | Window Cleaner | Detergent |
---|---|---|---|---|---|
45784548 | 1 | 1 | 0 | 0 | 0 |
45784549 | 1 | 0 | 1 | 1 | 0 |
45784550 | 1 | 0 | 0 | 0 | 1 |
45784551 | 1 | 1 | 0 | 0 | 1 |
45784552 | 0 | 1 | 0 | 1 | 0 |
Rules
An association rule is an unsupervised technique, which means the model will highlight rules, and it is up to us, as data scientists, to decide whether the rule is relevant. There are three types of rules :
- Actionable rules : supermarket customers who buy Barbie dolls also buy strawberry yogurt ;
- Trivial rules : customers who subscribe to a maintenance contract tend to buy large appliances ;
- Inexplicable rules : in a hardware store, one of the best-selling products with a screwdriver is toilet bowl cleaner ;
The concept of association rules is to examine all possible rules between products ( users, transactions, etc. ) according to the IF – THEN relationship and select only those that represent a true dependency. IF represents the antecedent, and THEN describes the consequences, giving us « If antecedent, then consequence ».
In association analysis, the antecedent and consequence are a combination of items. For example, if we have four products ( P1, P2, P3, P4 ), we get the following possible combinations : « If P1 then P2; If P1 then P3; If P1 then P4; If P1 and P2 then P3; If P1 and P3 then P2, … If P4 then …, etc ».
Steps
Steps for defining an association rule :
- Step 1 : define a minimum support threshold ( support rule ) ;
- Step 2 : apply the Apriori algorithm ;
- count the number of occurrences of an item in all transactions ( = support rule for an item )
- define associations
- Step 3 : calculate the support rule for each association ( and retain only those that meet the predefined threshold ) ;
- Step 4 : measure the strength of the association ( confidence rule and lift ratio ) ;
Support Rule
In any transactional dataset, there are many combinations. For a hundred products, the number of possible combinations can quickly reach millions.
Thus, the association rule must separate « strong » combinations from « weak » ones.
A practical solution is to only consider combinations that occur frequently, meaning they are frequent. This refers to the concept of the support rule.
Step 1 - defining the support rule - consists of identifying the number of transactions that contain the combinations of items from the antecedent and consequence. It measures the extent to which the data « supports » the validity of the rule.
Transaction ID | Orange Juice | Soda | Milk | Window Cleaner | Detergent |
---|---|---|---|---|---|
45784548 | 1 | 1 | 0 | 0 | 0 |
45784549 | 1 | 0 | 1 | 1 | 0 |
45784550 | 1 | 0 | 0 | 0 | 1 |
45784551 | 1 | 1 | 0 | 0 | 1 |
45784552 | 0 | 1 | 0 | 1 | 0 |
In our example above, the support for the combination [Orange Juice, Soda] - if we buy orange juice, we also buy soda - is or ( out of 5 transactions, in 2 transactions, the customer bought soda (consequence) after buying orange juice (antecedent) ).
Support Threshold
Setting a support threshold allows us to only retain combinations above this threshold. We can define the support rule as : « the estimated probability that a randomly selected transaction in the database contains all the items listed in both the antecedent and consequence ».
We measure the degree to which the data « supports » the validity of the rule. Support = (antecedent AND consequence).
For 1000 transactions, 500 contain a bicycle pump. Among the 500 transactions with a bicycle pump, 400 contain an inner tube. Support rule ?
If we buy a bicycle pump, we buy an inner tube. Support for the rule : or .
Apriori Algorithm
There are several algorithms to generate the most frequent associations, but the most common is the « Apriori » algorithm.
The key concept of the algorithm is to start by generating the most frequent items. From the most frequent item, it generates combinations of this first item with a second most frequent item to get the two most frequent items, then generates combinations with these two items to get the three most frequent items, and so on until combinations of all sizes have been generated.
« If a set of items Z is not frequent, then adding another item A to set Z will not make Z more frequent ». In practice, if the rule « A then B is weak », there is no point in testing the combination « A and B then C ».
Measuring the Strength of Association
From the abundance of rules generated by the Apriori algorithm, we can identify the combination rules that represent a strong dependence between the antecedent and consequence. To do this, we measure the strength of the association using two indicators : the confidence measure and the lift ratio.
Confidence Measure
The confidence measure expresses the degree of uncertainty of the rule « If antecedent, then consequence ». This measure compares the co-occurrence of a combination of items « if antecedent, then consequence » in the database to the occurrence of the antecedent in that same database.
The confidence rule is the estimated conditional probability that a randomly selected transaction among transactions containing all items from the antecedent will also contain all items from both the antecedent AND the consequence.
The confidence rule is calculated as follows :
.
In our example of the combination [Orange Juice, Soda], the support is or ( out of 5 transactions, in 2 transactions, the customer bought soda (consequence) after buying orange juice (antecedent) ).
Transaction ID | Orange Juice | Soda | Milk | Window Cleaner | Detergent |
---|---|---|---|---|---|
45784548 | 1 | 1 | 0 | 0 | 0 |
45784549 | 1 | 0 | 1 | 1 | 0 |
45784550 | 1 | 0 | 0 | 0 | 1 |
45784551 | 1 | 1 | 0 | 0 | 1 |
45784552 | 0 | 1 | 0 | 1 | 0 |
The confidence in this case is or . Among my transactions, contain Orange Juice. Among those transactions containing Orange Juice ( antecedent ), contain both the antecedent and the consequence.
Reminder of the difference between the support threshold and the confidence rule :
For transactions, contain a bicycle pump. Among the transactions with a bicycle pump, contain an inner tube. Rule : If we buy a bicycle pump, we buy an inner tube. Support for the rule : or . Confidence rule : or (or ).
Lift Ratio
A high confidence rule suggests a strong association rule.
However, this rule is not enough because if, in our entire dataset, all our customers bought bananas and almost all customers bought ice cream, the confidence rule for the association « if we buy bananas, we buy ice cream » will be high even if the antecedent and consequence are entirely independent.
The lift ratio allows us to compare the confidence rule with the expected confidence we would get if the antecedent and consequence were independent.
To calculate the lift ratio, we must first calculate the benchmark confidence.
Benchmark confidence :
The lift ratio is thus a comparison of the confidence rule with the expected confidence value.
A Lift Ratio greater than 1 suggests that the rule is useful. In other words, the level of association between the antecedent and consequence is higher than what we would expect if they were independent.
Code Python
# Requirements :
#!pip install apyori
#!pip install mlxtend
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
'''
#Synthetic dataset
np.random.seed(0)
dataset = pd.DataFrame(np.random.randint(0, 2, size=(10000, 6)), columns=['A', 'B', 'C', 'D', 'E', 'F'])
dataset
'''
dataset = pd.read_csv('data.csv')
dataset = dataset[['Column_1', 'Column_2', 'Column_3', 'Column_n']]
frequent_itemsets = apriori(dataset, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules_symmetric = rules[~rules['antecedents'].apply(frozenset).duplicated()]
selected_columns = ['antecedents', 'consequents', 'support', 'confidence', 'lift', 'leverage']
sorted_rules = rules_symmetric[selected_columns].sort_values(by='lift', ascending=False)
print(sorted_rules)