Data modeling

As mentioned in the previous chapter, business intelligence refers to tools that offer ETL ( Extract - Transform - Load ) functionalities. A BI solution - a report containing multiple dashboards - allows connection to one or more data sources.

A data source is a database containing all the information from a software used daily by the company. Most databases are complex because they were not designed for analytical needs but rather to adapt to the continuous evolution of a company's operations. Consequently, a database regularly evolves as software also evolves and undergoes numerous modifications and adaptations throughout its usage.

For example, let's consider a hospital that builds a new annex for new treatments and uses various software daily : an accounting / purchasing ERP, a « patient record » software, an admission and billing software, a pharmacy software, etc. The creation of a new annex will impact all these software, whose database architecture will need to be modified.

This modification can be minor ( adding a new care unit ) or major ( a change in legislation requiring that a database field be calculated in one way before a certain date and another way afterward ). It's also possible that adding the care unit to the tables corresponding to the old care units is not feasible due to some interdependencies between tables, and to prevent future evolution, from a certain date - the emergence of new care units - all information is directed to new tables, meaning that information must be retrieved from the old tables before a certain date and from the new tables afterward.

It's also likely that many variables ( table columns ) contain values considered aberrant or erroneous due to certain mandatory software flexibilities to meet specific business cases.

For example, a male patient who previously received maternity care ; a 30-year-old patient in a geriatric ward ; a patient aged 442 years ; etc.

All data requires a thorough examination of available variables univariate statistical analysis as well as the preparation, transformation of information - feature engineering -.

In business intelligence, data modeling can be divided into two parts :

Data transformation : transforming existing column values or creating new columns from existing ones.
Simplifying the relational model into a star schema to ensure tool efficiency.

Data transformation

Data transformation involves applying functions to existing variables to either transform values ( correct values ) or create values from existing ones.

For example, in the hospital, in the « Patient » table, we find the columns « First Name » and « Last Name », and for specific data visualization needs, we create a new column « Patient Name », which combines the two aforementioned columns. This operation corresponds to the creation of a new variable based on existing variables. Another example, in the table that records care units ( CU ), we find in one data source the values CUFA1 and CUFA2, while in another data source, we only find the value CUFA. The latter term CUFA resulted from the merging of care units CUFA1 and CUFA2 following a new funding legislation, but one of the software ( admission ) was not updated. Therefore, we will use a formula to transform the existing values and instruct the software to replace the CUFA1 and CUFA2 values with CUFA from a certain date.

Data transformation can be done either through a code editor - integrated into most tools - where the data architect uses a language specific to the tool to apply these modifications, or there are also « low code » interfaces in most tools that allow users to make many changes through a user-friendly interface with specific selections. Modifications at the BI tool level do not impact the data source ; these are code instructions that execute in the BI tool during data reload - reading the original data -.

Data transformation also includes a set of minor modifications - formatting - such as renaming a variable, changing the data type, modifying formatting, extracting part of the information, etc.

Star schema

When a data architect works on creating a business intelligence application on one or more data sources, it's their duty to ensure the efficiency of the application, and they must consequently review the relational schema of the tables. It is important to understand that by default, when a user makes a selection in a Dashboard - filters the information - the tool reacts instantly, meaning that the visuals adjust to the selection.

The reason is simple : the query formed by the user is not sent to the data source but directly to the BI tool which, during data loading ( extraction from one or more sources ), creates a copy of the information. Therefore, the user is no longer querying the original data source but the data schema in the tool. This copy of the information is made according to specific schedules depending on the need for data refreshment ( once a day, every hour, every 5 minutes ). A user can also run the script at any time to update the information, meaning loading the new data recorded in the database since the last copy. Of course, copying information into the tool's memory is the default state, but in some cases—particularly for tables with large amounts of information ( fact tables ) - it is possible to be directly connected to the data source and get continuous refreshment for that table -.

The first element that promotes the responsiveness of a business intelligence tool is that the query does not directly access the data source. The second element is that the data architect's mission is to simplify the relational schema between the tables and move towards what is called a datamart - a star schema -.

Star Schema

Transactional vs dimensional tables

The star schema is the most commonly used approach in business intelligence and also in data warehouses to promote the efficiency of data analysis tools. The star schema involves creating a datamart, a simplified schema where we find :

A fact table - transactional table - which is at the center of the schema. This is the main table containing all the events or transactions. A fact table is easily identifiable as it generally contains many rows for a limited number of columns, which are mainly key columns related to dimension tables and measurement columns ( quantitative variables ) as well as a temporal notion - date column ( or timestamp ) -.
Dimension tables which contain one or more key columns as well as dimensional information beyond the key. Dimension tables generally have many columns but few rows and are mainly used to filter the measures.

tip

The relationship between a dimension table and a fact table is one-to-many ( 1 - * ). A single record in the dimension table is linked to multiple records in the fact table.

Normalization & Denormalization

A fact table is normalized : normalization describes data stored to reduce repetitive information. For example, a product table contains the product key along with many columns related to information about that key ( product name, product category, etc. ). In the fact table, it is unnecessary to repeat information related to the key that goes beyond this information. The key alone is sufficient to link all the information. Thus, the table is said to be normalized - the number of columns is reduced to the strict minimum necessary -.

Conversely, a dimension table is denormalized : denormalization describes data stored with all information taken into account. The product dimension table contains the product key along with all fields beyond that key, such as the product name, category, etc.

Normalized fact table (sales)

Sales_Key	Product_Key	Customer_Key	Date	Quantity	Amount
1	101	1	2024-01-01	2	100
2	102	2	2024-02-15	1	50
3	101	1	2024-03-20	3	150
4	103	3	2024-04-10	5	250

Denormalized fact table (sales)

Sales_Key	Prod_Key	ProductName	Category	Cust_Key	CustomerName	City	Date	Quantity	Amount
1	101	Product A	Electronics	2	John Doe	New York	2024-01-01	2	100
2	102	Product B	Clothing	1	Jane Smith	London	2024-02-15	1	50
3	101	Product A	Electronics	3	John Doe	New York	2024-03-20	3	150
4	103	Product C	Books	5	Alice Brown	Paris	2024-04-10	5	250

warning

A data architect will always load table by table from a data source to conceptualize the star schema, which means grouping transactional information into a central - unique - table, normalizing this transactional table, and adding a set of denormalized dimension tables to the model.

Table Merging

To achieve the creation of a star schema, i.e. grouping ( merging) a set of tables into a limited number in a simplified model, the data architect has two options :

Apply a join function to add the columns of a table B to a table A based on the values of one or more common key columns. This is the JOIN function, which can take different prefixes ( outer, inner, left, right ).

Apply a concatenation function ( horizontal axis - also known as UNION ) to add the records of a table X below the records of a table Y based on ( generally ) several ( or all ) common columns.

Join

Joining adds one or more columns from a table B into a table A, provided that both tables have one or more common columns and common values within those columns.

Below we find two tables - separated in the data model - the products table ( products related to Yoga practice ) and the categories table. We want to group these tables ( merge them ), i.e. integrate the CategoryName from the categories table into the products table. Both tables have a common column ( CategoryID ) with most values in common.

warning

We notice, in the tables below, that the products table contains a product ( Yoga Perfume ) linked to a CategoryID of 6 that does not exist in the categories table, and we see that we have CategoryID 5 in the categories table, for which we have no related products in the products table.

ProductID Product CategoryID 64 Mat Green 1 65 Mat Blue 1 66 Mat Cleaner 2 67 Yoga Bricks 2 68 Mat Straps 3 69 Mat Bag 3 70 Trousers 4 71 Legging 4 72 Yoga Perfume 6 CategoryID CategoryName 1 Yoga Mat 2 Yoga Accessories 3 Yoga Mat Bag 4 Women Yoga Clothes 5 Men Yoga Clothes

ProductID	Product	CategoryID
64	Mat Green	1
65	Mat Blue	1
66	Mat Cleaner	2
67	Yoga Bricks	2
68	Mat Straps	3
69	Mat Bag	3
70	Trousers	4
71	Legging	4
72	Yoga Perfume	6

CategoryID	CategoryName
1	Yoga Mat
2	Yoga Accessories
3	Yoga Mat Bag
4	Women Yoga Clothes
5	Men Yoga Clothes

Full Outer Join

By default, if we want to merge the two tables, the merge function will include in the resulting table ( the final table, which will be unique ) all the values from table A and all the values from table B ; this is called an Outer Join - no information is lost due to the merge -.

The Python function comes from the pandas library (pd.merge)

import pandas as pd

# Full outer join
df_final = pd.merge(df_products, df_categories, on='CategoryID', how='outer')

# Filling missing values with 'N/A'
df_final.fillna('N/A', inplace=True)

df_final

The result is the following table :

ProductID	Product	CategoryID	CategoryName
64	Mat Green	1	Yoga Mat
65	Mat Blue	1	Yoga Mat
66	Mat Cleaner	2	Yoga Accessories
67	Yoga Bricks	2	Yoga Accessories
68	Mat Straps	3	Yoga Mat Bag
69	Mat Bag	3	Yoga Mat Bag
70	Trousers	4	Women Yoga Clothes
71	Legging	4	Women Yoga Clothes
72	Yoga Perfume	6	N/A
N/A	N/A	5	Men Yoga Clothes

Inner Join

The Inner prefix removes from the final table all records that are not common to both tables. This means that all CategoryID key values not present in both tables will be excluded. Consequently, if we have one or more products whose category does not exist in the categories table, and one or more categories in the categories table for which we have no CategoryID in the products table, these records will be omitted.

For example, in our case, the final table will not include ProductID 72 - which is not linked to any category - nor Category 5, which is not present in the products table.

ProductID Product CategoryID 64 Mat Green 1 65 Mat Blue 1 66 Mat Cleaner 2 67 Yoga Bricks 2 68 Mat Straps 3 69 Mat Bag 3 70 Trousers 4 71 Legging 4 72 Yoga Perfume 6 CategoryID CategoryName 1 Yoga Mat 2 Yoga Accessories 3 Yoga Mat Bag 4 Women Yoga Clothes 5 Men Yoga Clothes

ProductID	Product	CategoryID
64	Mat Green	1
65	Mat Blue	1
66	Mat Cleaner	2
67	Yoga Bricks	2
68	Mat Straps	3
69	Mat Bag	3
70	Trousers	4
71	Legging	4
72	Yoga Perfume	6

CategoryID	CategoryName
1	Yoga Mat
2	Yoga Accessories
3	Yoga Mat Bag
4	Women Yoga Clothes
5	Men Yoga Clothes

We specify the « how » in the Python function, and this time it's an inner join.

import pandas as pd

# Inner join
df_final = pd.merge(df_products, df_categories, on='CategoryID', how='inner')

df_final

And We obtain the following table :

ProductID	Product	CategoryID	CategoryName
64	Mat Green	1	Yoga Mat
65	Mat Blue	1	Yoga Mat
66	Mat Cleaner	2	Yoga Accessories
67	Yoga Bricks	2	Yoga Accessories
68	Mat Straps	3	Yoga Mat Bag
69	Mat Bag	3	Yoga Mat Bag
70	Trousers	4	Women Yoga Clothes
71	Legging	4	Women Yoga Clothes

Left Join

The left prefix is the most commonly used. It loads all the records from table A but keeps only the common key values from table B. This prefix thus removes from the final table all records from table B whose CategoryID key values do not exist in table A. Consequently, if we have one or more categories that do not exist in the products table, these rows will not be included in the final table.

For example, in our case, the final table will not contain Category 5, which is not linked to any product, but it will keep ProductID 72, which is not linked to any category.

ProductID Product CategoryID 64 Mat Green 1 65 Mat Blue 1 66 Mat Cleaner 2 67 Yoga Bricks 2 68 Mat Straps 3 69 Mat Bag 3 70 Trousers 4 71 Legging 4 72 Yoga Perfume 6 CategoryID CategoryName 1 Yoga Mat 2 Yoga Accessories 3 Yoga Mat Bag 4 Women Yoga Clothes 5 Men Yoga Clothes

ProductID	Product	CategoryID
64	Mat Green	1
65	Mat Blue	1
66	Mat Cleaner	2
67	Yoga Bricks	2
68	Mat Straps	3
69	Mat Bag	3
70	Trousers	4
71	Legging	4
72	Yoga Perfume	6

CategoryID	CategoryName
1	Yoga Mat
2	Yoga Accessories
3	Yoga Mat Bag
4	Women Yoga Clothes
5	Men Yoga Clothes

We specify "how" in the Python function, and this time it's a left join.

import pandas as pd

# Left join
df_final = pd.merge(df_products, df_categories, on='CategoryID', how='left')

# Filling missing values with 'N/A'
df_final.fillna('N/A', inplace=True)

df_final

And we obtain the following table :

ProductID	Product	CategoryID	CategoryName
64	Mat Green	1	Yoga Mat
65	Mat Blue	1	Yoga Mat
66	Mat Cleaner	2	Yoga Accessories
67	Yoga Bricks	2	Yoga Accessories
68	Mat Straps	3	Yoga Mat Bag
69	Mat Bag	3	Yoga Mat Bag
70	Trousers	4	Women Yoga Clothes
71	Legging	4	Women Yoga Clothes
72	Yoga Perfume	6	N/A

Right Join

The right prefix is the least used. It loads all the records from table B but keeps only the common key values from table A. This prefix thus removes from the final table all records from table A whose CategoryID key values do not exist in table B. Consequently, if we have one or more products that exist in the products table but are not linked to any category in the categories table, these rows will not be included in the final table.

For example, in our case, the final table will not contain Product 72, which is not linked to any category, but it will keep CategoryID 5, which is not linked to any product.

ProductID Product CategoryID 64 Mat Green 1 65 Mat Blue 1 66 Mat Cleaner 2 67 Yoga Bricks 2 68 Mat Straps 3 69 Mat Bag 3 70 Trousers 4 71 Legging 4 72 Yoga Perfume 6 CategoryID CategoryName 1 Yoga Mat 2 Yoga Accessories 3 Yoga Mat Bag 4 Women Yoga Clothes 5 Men Yoga Clothes

ProductID	Product	CategoryID
64	Mat Green	1
65	Mat Blue	1
66	Mat Cleaner	2
67	Yoga Bricks	2
68	Mat Straps	3
69	Mat Bag	3
70	Trousers	4
71	Legging	4
72	Yoga Perfume	6

CategoryID	CategoryName
1	Yoga Mat
2	Yoga Accessories
3	Yoga Mat Bag
4	Women Yoga Clothes
5	Men Yoga Clothes

We specify the « how » in the python function, and this time it's a right.

import pandas as pd

# Right join
df_final = pd.merge(df_products, df_categories, on='CategoryID', how='right')

# Filling missing values with 'N/A'
df_final.fillna('N/A', inplace=True)

df_final

And we obtain the following table :

ProductID	Product	CategoryID	CategoryName
64	Mat Green	1	Yoga Mat
65	Mat Blue	1	Yoga Mat
66	Mat Cleaner	2	Yoga Accessories
67	Yoga Bricks	2	Yoga Accessories
68	Mat Straps	3	Yoga Mat Bag
69	Mat Bag	3	Yoga Mat Bag
70	Trousers	4	Women Yoga Clothes
71	Legging	4	Women Yoga Clothes
N/A	N/A	5	Men Yoga Clothes

Concatenation

Applying a concatenation function ( horizontal axis - also known as UNION ) consists of adding new records from table X to existing records in table Y. The new records are integrated into the table as new rows. Concatenation is generally applied between two tables that have an almost identical structure.

ProductID Product 64 Mat Green 65 Mat Blue 66 Mat Cleaner 67 Yoga Bricks 68 Mat Straps 69 Mat Bag 70 Trousers 71 Legging 72 Yoga Perfume ProductID Product Code 73 Mat Red 654983211 74 Sandbag 953625923 75 Cylindrical Bolster 278534109

ProductID	Product
64	Mat Green
65	Mat Blue
66	Mat Cleaner
67	Yoga Bricks
68	Mat Straps
69	Mat Bag
70	Trousers
71	Legging
72	Yoga Perfume

ProductID	Product	Code
73	Mat Red	654983211
74	Sandbag	953625923
75	Cylindrical Bolster	278534109

Here is the python code :

import pandas as pd

df_final = pd.concat([df1, df2], ignore_index=True)
df_final = df_final.fillna('NA')

df_final

df_final

And we obtain the following table :

ProductID	Product	Code
64	Mat Green	N/A
65	Mat Blue	N/A
66	Mat Cleaner	N/A
67	Yoga Bricks	N/A
68	Mat Straps	N/A
69	Mat Bag	N/A
70	Trousers	N/A
71	Legging	N/A
72	Yoga Perfume	N/A
73	Mat Red	654983211
74	Sandbag	953625923
75	Cylindrical Bolster	278534109

Data transformation​

Star schema​

Transactional vs dimensional tables​

Normalization & Denormalization​

Table Merging​

Join​

Full Outer Join​

Inner Join​

Left Join​

Right Join​

Concatenation​

Data transformation

Star schema

Transactional vs dimensional tables

Normalization & Denormalization

Table Merging

Join

Full Outer Join

Inner Join

Left Join

Right Join

Concatenation