Data visualization

Data visualization is rooted in Gestalt theory, which suggests that when faced with complexity, the human mind spontaneously structures visual information to simplify its interpretation.

Human vision accounts for 80 % of the information transmitted to the brain. The principle of data visualization is therefore to provide information in a manner that is both cognitively adapted and engaging from a human perspective.

« The greatest value of a picture is when it forces us to notice what we never expected to see ». ( John Tukey - one of the most important American statisticians of the 20th century ).

Data visualization allows large amounts of raw data to be transformed into clear and accessible information through the use of charts and other visual representations. It facilitates the understanding, analysis, and communication of data within a company or organization.

Measures & Dimensions

In data visualization, we no longer speak of quantitative or qualitative variables Data Properties, but of measures - key performance indicators ( KPI's ) - ( i.e. what do we want to analyze ) and dimensions ( i.e. by what means do we want to analyze this measure ).

For example, let's take sales amount as a key performance indicator ( KPI ). A measure is an aggregation ( min, max, sum, avg, etc. ) applied to a column in a table, and it usually involves aggregating a numerical and therefore quantitative column. For the sales amount, we have a table that includes an amount for each fruit, and we aggregate this column by applying a sum, Sum[Amount].

Fruits	Amount
Apple	40
Blueberry	100
Cherry	30
Orange	55

If we decide to visually represent the KPI « sales amount » sum[Amount] in a bar chart - therefore only a measure without a dimension - we get a single bar, as this is the most aggregated form of data ; only the measure, which is $225$ .

Bar Chart measure only 500

However, if we add the dimension « fruits » : we want to analyze sales by fruit ( this could be by store, by customer, etc. ) ; we get one bar per dimension value. The measure, which was in its most aggregated form, is now disaggregated by the values of the dimension used ; in this case : fruit values.

Bar Chart measure & dimension 500

In the first case, by adding the measure, we obtain the most aggregated form ( = what do we want to analyze | answer : the sales amount ), and in the second case, by adding the dimension, we disaggregate ( break down ) the measure by as many values as exist in the dimension, resulting in four bars since there are four fruits.

warning

A measure can also apply to a qualitative variable. For example : fruits - if we want to know how many fruits we have, we apply the « count » aggregation on fruits - Count[Fruits] ( which is a nominal qualitative variable ) and we get $4$ . The count aggregation is often used to create measures from qualitative variables.

Categories of Visuals

The most complex question in data visualization concerns the choice of the chart. Which chart for which measure and / or dimension ? It's important to specify that depending on the visualization, it is possible to add multiple dimensions and / or multiple measures.

For example, a bar chart - the most popular graph because it is the most effective for the human brain - allows for several configurations : (1) horizontal vs vertical ; (2) 1 measure and 2 grouped dimensions ; (3) 1 measure and 2 stacked dimensions ; (4) 1 dimension multiple measures ; (5) stacked or non-stacked if there are multiple measures or multiple dimensions ; (6) grouped, normal stacked, or 100% stacked, etc.

The best recommendation when starting with data visualization is to keep simplicity in mind. The use of complex visualizations should only be applied when there is a certain level of data maturity within the organization. This means that when end users are accustomed to and trained in it. Otherwise, or when a project is initiated, it is recommended to focus on the most well-known ( and effective according to the situation ) visualizations, namely : bar chart, line chart, combo chart, pie chart, scatter plot, and tables ( straight & pivot ).

There are four basic types of presentations :

Comparison Visualizations

Comparison visualizations address basic questions when starting with data visualization, such as « which customer represents the highest sales ? » or « which product is the most expensive to purchase ? » or « How has the revenue evolved this year compared to last year ? ». These are used to compare the magnitude of values against each other to quickly identify the lowest and highest values or to compare current values with past values.

Bar chart 1 measure & 1 dimension ; Line Chart 1 measure & 1 dimension

BarChart500 linechart500

Bar chart 1 measure & 2 dimensions ; Line Chart 1 measure & 1 dimension

BarChart500 linechart500

Trelli chart : creation of multiple charts using the same measure and dimension but repeated by another dimension

TrelliChart500

Relationship Visualizations

Relationship visualizations are used to identify correlations, outliers, or visualize clusters, i.e. groups of points where the points within a group are close to each other, but the groups are spaced apart. The scatter plot is the primary chart in this category.

Left : Scatter plot 3 measures ( Y-axis, X-axis + bubble size ) & 2 dimensions ( bubbles & bubble color ) : identification of correlations, outliers, or clusters ; Right : Sankey chart highlighting flow relationships

ScatterPlot500 SankeyChart500

Composition Visualizations

Composition visualizations allow for comparing a value relative to a whole and often involve the concept of parts. Values are usually expressed as percentages. They are very useful for summarizing a set of nominal data or presenting the different values of a given variable ( percentage distribution ). They answer questions like « What is the sales share by region ? »

Pie chart & Donut Chart

PieChart500 donutchart500

warning

A best practice is to never use a dimension containing more than 7 different values, as interpretation becomes more difficult from a cognitive perspective, and in this case, it is better to switch to a bar chart.

The 100 % stacked bar chart is an alternative when dealing with more than 7 dimensional values. On the right, the waterfall chart is also a composition chart and can be used with either one measure and one dimension or only measures.

StackedBarChart500 Waterfall500

The treemap and the stacked Area Line chart are also two important charts for representing composition.

Treemap500 StackedAreaChart500

Distribution Visualizations

Distribution visualizations are used to highlight how quantitative values are distributed along an axis. These charts make it easy to identify the highest and lowest values, get an idea of the distribution shape, highlight basic descriptive values such as median, mean, or mode, and identify outliers.

Histogram500 DistributionLine500

Candles500

Explore more on eaqbe.com :

Learn about RPA AI solutions: https://www.eaqbe.com/technologies
Transform your processes with intelligent automation powered by RPA and AI.
Discover our consulting services: https://www.eaqbe.com/services
Tailored consulting to optimize your data and decision-making strategies.
Explore our training programs: https://www.eaqbe.com/trainings
Build your expertise with targeted training sessions.
Meet the eaqbe team: https://www.eaqbe.com/team
Discover the people driving innovation and excellence at eaqbe.

Python Codes

tip

Looking for a specific visualization in Python? Find a wealth of visuals with examples at Python Graph Gallery (Select Chart Types).

Bar Chart

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 10))

fruits = ['apple', 'blueberry', 'cherry', 'orange']
counts = [40, 100, 30, 55]
bar_labels = ['red', 'blue', '_red', 'orange']
colors = ['#91DCEA', '#64CDCC', '#5FBB68', '#F9D23C', '#F9A729', '#FD6F30']


combined = sorted(zip(counts, fruits, bar_labels, colors), reverse=True)

counts, fruits, bar_labels, colors = zip(*combined)
bars = ax.bar(fruits, counts, label=bar_labels, color=colors[:len(fruits)])

ax.bar_label(bars, label_type='edge')
ax.set_ylabel('Fruit supply')
ax.set_title('Fruit supply by kind and color')

plt.show()

Line Chart

import matplotlib.pyplot as plt

day = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
#sales_product1 = [250,350,265,150,320,450,200]
sales_product2 = [300, 280, 350, 200, 400, 380, 250]  
colors = ['#91DCEA', '#64CDCC', '#5FBB68',
          '#F9D23C', '#F9A729', '#FD6F30']
plt.figure(figsize=(10, 10))

#plt.plot(day, sales_product1, color='#F9A729', marker='o', label='Product 1')
plt.plot(day, sales_product2, color='#64CDCC', marker='s', label='Product 2')  


#for i, sale in enumerate(sales_product1):
     #plt.annotate(str(sale), (day[i], sale), textcoords="offset points", xytext=(0,10), ha='center')
for i, sale in enumerate(sales_product2):
     plt.annotate(str(sale), (day[i], sale), textcoords="offset points", xytext=(0,-10), ha='center') 


plt.ylabel('Sales')
plt.title('Sales by weekday')
plt.ylim(0, 500) 
plt.show()

Bar Chart Grouped

import matplotlib.pyplot as plt
import numpy as np

species = ("Adelie", "Gentoo")


penguin_means = {
    'Bill Depth': (18.35, 14.98), 
    'Bill Length': (38.79, 47.50), 
    'Flipper Length': (189.95, 217.19), 
}

x = np.arange(len(species))  
width = 0.25  
multiplier = 0

colors = ['#91DCEA', '#64CDCC', '#5FBB68', '#F9D23C', '#F9A729', '#FD6F30']


fig, ax = plt.subplots(figsize=(10, 10))


for i, (attribute, measurement) in enumerate(penguin_means.items()):
    offset = width * multiplier
    rects = ax.bar(x + offset, measurement, width, label=attribute, color=colors[i % len(colors)])
    ax.bar_label(rects, padding=3)
    multiplier += 1


ax.set_ylabel('Length (mm)')
ax.set_title('Penguin attributes by species')
ax.set_xticks(x + width, species)
ax.legend(loc='upper left', ncols=3)
#ax.set_ylim(0, 250)

plt.show()

Line Chart Grouped

# LineChart python
import matplotlib.pyplot as plt

day = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
sales_product1 = [250,350,265,150,320,450,200]
sales_product2 = [300, 280, 350, 200, 400, 380, 250]  
colors = ['#91DCEA', '#64CDCC', '#5FBB68',
          '#F9D23C', '#F9A729', '#FD6F30']
plt.figure(figsize=(10, 10))

plt.plot(day, sales_product1, color='#F9A729', marker='o', label='Product 1')
plt.plot(day, sales_product2, color='#64CDCC', marker='s', label='Product 2')  


for i, sale in enumerate(sales_product1):
     plt.annotate(str(sale), (day[i], sale), textcoords="offset points", xytext=(0,10), ha='center')
for i, sale in enumerate(sales_product2):
     plt.annotate(str(sale), (day[i], sale), textcoords="offset points", xytext=(0,-10), ha='center') 


plt.ylabel('Sales')
plt.title('Sales by weekday')
plt.ylim(0, 500) 
plt.show()

Treilli Line Chart

# Treilli Line Chart python

import matplotlib.pyplot as plt


day = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
sales_product1 = [250, 350, 265, 150, 320, 450, 200]
sales_product2 = [300, 280, 350, 200, 400, 380, 250]

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 10), sharex=True)

ax1.plot(day, sales_product1, color='#F9A729', marker='o')
ax1.set_ylabel('Sales')
ax1.set_title('Sales of Product 1')
ax1.set_ylim(0, 500)
ax1.legend()

ax2.plot(day, sales_product2, color='#64CDCC', marker='s')
ax2.set_xlabel('Day')
ax2.set_ylabel('Sales')
ax2.set_title('Sales of Product 2')
ax2.set_ylim(0, 500)
ax2.legend()

plt.tight_layout()

plt.show()

Treilli Bar Chart

# TrelliB Bar Chart Python
import matplotlib.pyplot as plt
import numpy as np


species = ("Adelie", "Gentoo")


penguin_means = {
    'Bill Depth': (18.35, 14.98),  
    'Bill Length': (38.79, 47.50), 
    'Flipper Length': (189.95, 217.19), 
}

x = np.arange(len(species))  
width = 0.25 

colors = ['#91DCEA', '#64CDCC', '#5FBB68', '#F9D23C', '#F9A729', '#FD6F30']

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 10), sharex=True)


for i, (attribute, measurement) in enumerate(penguin_means.items()):
    rects = ax1.bar(x + i * width, [measurement[0]], width, label=attribute, color=colors[i % len(colors)])
    ax1.bar_label(rects, padding=3)

ax1.set_ylabel('Length (mm)')
ax1.set_title('Penguin Attributes for Adelie')
ax1.set_xticks(x + width, species)
ax1.legend(loc='upper left', ncols=3)
ax1.set_ylim(0, 250)

for i, (attribute, measurement) in enumerate(penguin_means.items()):
    rects = ax2.bar(x + i * width, [measurement[1]], width, label=attribute, color=colors[i % len(colors)])
    ax2.bar_label(rects, padding=3)

ax2.set_xlabel('Species')
ax2.set_ylabel('Length (mm)')
ax2.set_title('Penguin Attributes for Gentoo')
ax2.set_xticks(x + width, species)
ax2.legend(loc='upper left', ncols=3)
ax2.set_ylim(0, 250)

plt.tight_layout()

plt.show()

Scatter Plot

# Scatter plot python

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.colors import ListedColormap


np.random.seed(19680801)

N = 50
x = np.random.rand(N)
y = np.random.rand(N)

z = np.random.rand(N)

color_dimension = pd.qcut(z, 3, labels=False)

area = (30 * np.random.rand(N))**2

colors = ['#91DCEA', '#64CDCC', '#5FBB68']
cmap = ListedColormap(colors)

fig, ax = plt.subplots(figsize=(10, 10))

scatter = ax.scatter(x, y, s=area, c=color_dimension, alpha=0.5, cmap=cmap)

plt.show()

Sankey Chart

# Sankey Chart python

import plotly.graph_objects as go


colors = ['#91DCEA', '#64CDCC', '#5FBB68', '#F9D23C', '#F9A729', '#FD6F30']

fig = go.Figure(data=[go.Sankey(
    node = dict(
        pad = 15,
        thickness = 20,
        line = dict(color = "black", width = 0.5),
        label = ["A1", "A2", "B1", "B2", "C1", "C2"],
        color = colors  
    ),
    link = dict(
        source = [0, 0, 1, 2, 3, 3], 
        target = [2, 3, 3, 4, 4, 5],
        value =  [8, 2, 4, 8, 4, 2],
    ))])


fig.update_layout(
    title_text=" Sankey Chart", 
    font_size=10,
    width=1200,  
    height=1200   
)

fig.show()

Pie Chart

# Pie Chart python

import matplotlib.pyplot as plt

labels = 'apple', 'banana', 'cherry', 'durian', 'elderberries', 'figs', 'grapes'
sizes = [32, 20, 15, 10, 10, 8, 5]
colors = ['#91DCEA', '#64CDCC', '#5FBB68',
          '#F9D23C', '#F9A729', '#FD6F30']

plt.figure(figsize=(10, 10)) 
p = plt.pie(sizes, labels=labels, colors=colors, explode=(0.07, 0, 0, 0, 0, 0, 0),
            autopct='%1.0f%%', startangle=130, shadow=True)
plt.axis('equal')

for i, (apple, banana, cherry, durian, elderberries, figs, grapes) in enumerate(p):
    if i > 0:
        apple.set_fontsize(12)
        banana.set_fontsize(12)
        cherry.set_fontsize(12)
        durian.set_fontsize(12)
        elderberries.set_fontsize(12)
        figs.set_fontsize(12)
        grapes.set_fontsize(12)

plt.show()

Donnut Chart

# Donnut Chart python

import matplotlib.pyplot as plt

labels = 'apple', 'banana', 'cherry', 'durian', 'elderberries', 'figs', 'grapes'
sizes = [32, 20, 15, 10, 10, 8, 5]

my_circle = plt.Circle((0, 0), 0.7, color='white')
colors = ['#91DCEA', '#64CDCC', '#5FBB68',
          '#F9D23C', '#F9A729', '#FD6F30']
plt.figure(figsize=(10, 10)) 
d = plt.pie(sizes, labels=labels,colors=colors, autopct='%1.0f%%',
            startangle=90, labeldistance=1.05)

plt.axis('equal')
plt.gca().add_artist(my_circle)
            
plt.show()

Bar Chart Stacked 100 %

# Stacked Bar CHart 100% python

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as mcolors  

category_names = ['Strongly disagree', 'Disagree',
                  'Neither agree nor disagree', 'Agree', 'Strongly agree']
results = {
    'Question 1': [10, 15, 17, 32, 26],
    'Question 2': [26, 22, 29, 10, 13],
    'Question 3': [35, 37, 7, 2, 19],
    'Question 4': [32, 11, 9, 15, 33],
    'Question 5': [21, 29, 5, 5, 40],
    'Question 6': [8, 19, 5, 30, 38]
}

colors = ['#91DCEA', '#64CDCC', '#5FBB68', '#F9D23C', '#F9A729']

def survey(results, category_names):
    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        rects = ax.barh(labels, widths, left=starts, height=0.5,
                        label=colname, color=color)

        r, g, b = mcolors.to_rgb(color)  
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        ax.bar_label(rects, label_type='center', color=text_color)
    ax.legend(ncols=len(category_names), bbox_to_anchor=(0, 1),
              loc='lower left', fontsize='small')

    return fig, ax

survey(results, category_names)
plt.show()

Waterfall

# waterfall python

import plotly.graph_objects as go


categories = ['Revenue', 'COGS', 'Gross Profit', 'Operating Expenses', 'Net Income']
values = [1000000, -600000, 400000, -250000,  70000]


colors = ['#91DCEA', '#64CDCC', '#5FBB68', '#F9D23C', '#F9A729', '#FD6F30']


cumulative_values = [0]
for value in values:
    cumulative_values.append(cumulative_values[-1] + value)


fig = go.Figure(go.Waterfall(
    name = "20", 
    orientation = "v",
    measure = ["relative"] * (len(values) - 1) + ["total"],
    x = categories,
    textposition = "outside",
    text = values,
    y = values,
    connector = {"line":{"color":"rgb(63, 63, 63)"}},
    increasing = {"marker":{"color": '#91DCEA'}}, 
    decreasing = {"marker":{"color": '#F9A729'}}, 
    totals = {"marker":{"color": '#5FBB68'}}  
))


fig.update_layout(
    title = "Company Income Breakdown",
    showlegend = True,
    width=1000, 
    height=1000,
    plot_bgcolor='white' 
)

fig.show()

Treemap

# Tree Map python

import matplotlib.pyplot as plt
import squarify    # pip install squarify 
import pandas as pd

colors = ['#91DCEA', '#64CDCC', '#5FBB68',
          '#F9D23C', '#F9A729', '#FD6F30']

df = pd.DataFrame({'nb_people':[8,3,4,2], 'group':["group A", "group B", "group C", "group D"] })

squarify.plot(sizes=df['nb_people'],color=colors, label=df['group'], alpha=.8 )
plt.axis('off')
plt.show()

Line Chart Stacked

# Stacked Line Chart python


import numpy as np
import matplotlib.pyplot as plt
 

x=range(1,6)
y=[ [1,4,6,8,9], [2,2,7,10,12], [2,8,5,10,6] ]
colors = ['#91DCEA', '#64CDCC', '#5FBB68',
          '#F9D23C', '#F9A729', '#FD6F30']
 

plt.figure(figsize=(10, 10)) 
plt.stackplot(x,y, labels=['A','B','C'], colors=colors)
plt.legend(loc='upper left')
plt.show()

Histogram

# Histogram python

import matplotlib.pyplot as plt
import numpy as np

from matplotlib import colors
from matplotlib.ticker import PercentFormatter

rng = np.random.default_rng(19680801)

N_points = 100000
n_bins = 20

dist1 = rng.standard_normal(N_points)
dist2 = 0.4 * rng.standard_normal(N_points) + 5

fig, axs = plt.subplots(1, 2,figsize=(10, 10))

axs[0].hist(dist1, bins=n_bins, color='#FD6F30')  
axs[1].hist(dist2, bins=n_bins, color='#64CDCC') 

plt.show()

Candles

# Candles python 

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(19680801)
fruit_weights = [
    np.random.normal(130, 10, size=100),
    np.random.normal(125, 20, size=100),
    np.random.normal(120, 30, size=100),
]
labels = ['peaches', 'oranges', 'tomatoes']
colors = ['#91DCEA', '#F9D23C', '#F9A729']  

fig, ax = plt.subplots()
ax.set_ylabel('fruit weight (g)')

bplot = ax.boxplot(fruit_weights,
                   patch_artist=True)  

ax.set_xticklabels(labels)


for patch, color in zip(bplot['boxes'], colors):
    patch.set_facecolor(color)

plt.show()

Measures & Dimensions​

Categories of Visuals​

Comparison Visualizations​

Relationship Visualizations​

Composition Visualizations​

Distribution Visualizations​

Python Codes​

Bar Chart​

Line Chart​

Bar Chart Grouped​

Line Chart Grouped​

Treilli Line Chart​

Treilli Bar Chart​

Scatter Plot​

Sankey Chart​

Pie Chart​

Donnut Chart​

Bar Chart Stacked 100 %​

Waterfall​

Treemap​

Line Chart Stacked​

Histogram​

Candles​