Skip to main content

Data Sources

Every data project relies on data sources. A data source refers to the storage location, whether physical or virtual, where information is available in various formats. As data scientists, our goal is to access and leverage this data.

Software Application

It is increasingly common to carry out a data project using multiple data sources. It's important to note that a data source is originally generated by a software program of some kind, which stores information in various forms. In the context of « business » data, this typically involves one or more software applications - a program created and used to digitally perform tasks in a specific domain -. For example, a program for managing employee schedules or one for handling accounting.

Enterprise Resource Planning (ERP)

The use of domain-specific programs operating in silos presents a challenge for data utilization, particularly in terms of cross-referencing or linking data for analysis purposes. More and more companies have turned to ERP - Enterprise Resource Planning - solutions, which are also programs but integrate a suite of applications designed to work together with a unified data architecture. For example, the ERP XYZ might manage employee schedules, payroll, accounting, purchasing, etc.

API (Application Programming Interface)

Today, every action taken by a user on a computer or web platform - of any kind - automatically generates data that is stored for future use. Information can flow from one system to another either through network protocols ( FTP or HTTP ) or, more commonly, through APIs.

An API - Application Programming Interface - is a solution that allows access to the functions or data of an application remotely. With an API, it is possible to establish communication with a target software to either use its functionalities or retrieve data. A request, expressed in a common language, is sent to the target software, which processes it and returns the expected result if access conditions are met.

The following analogy helps to better understand how an API works. Consider a restaurant. We have the customer - the person who wants to receive something ( in this case, a meal ) -, the chef who has all the ingredients to make all the dishes on the menu ( and more ! ), and finally, the waiter who acts as the intermediary between the customer and the chef. The customer arrives at the restaurant, and the waiter hands them the menu. The menu includes only what the customer is entitled to ( perhaps the restaurant also makes chocolate cakes, but that is not available to this customer ). The waiter gives the menu to the customer because the customer has properly authenticated as such and has access to the restaurant ( e.g. a dress code is required ). The customer clearly states in their order what they want - vegetable lasagna - and the waiter promptly asks the chef to prepare the dish because the customer is entitled to it. The customer has no idea how things work in the kitchen, nor how the lasagna is prepared, but that doesn’t matter, as long as they get what they want and are entitled to. After a few moments, the waiter returns with the dish and gives it to the customer. The customer now has the dish and can do with it as they please !

An API makes data accessible and defines how applications can exchange it. It allows data to be retrieved from system A to be reused in system B, with the advantage being that this retrieval does not require querying an entire database but rather extracting or using a specific element.

To obtain information via an API, we first need to create a structured request in which we specify the search criteria and formatting options ( how the data should be presented ).

A structured request can be thought of as a letter in which we explain what we want.

Example of a structured request ( simplified - depending on the API, some additional parameters may need to be set ) :

{
"start": 0, # O record = 1 result
"results": 100, # Maximum number of results to return
"outputTimeZone": "UTC", # Time zone for dates

"filter": { # Filtering criteria for data
"range": {
"Date": {
"from": "08/08/2024",
"to": "14/08/2024",
"dateFormat": "dd/MM/yyyy"
}
}
},

"logsources": [ # Data sources to be queried
{
"name": "Orders",
"type": "table"
}
],

"query": "*", # Data to be extracted (* = all)
"highlight": true,
"sortKey": [ "Date" ] # Sort ascending by date
}



Next, we need to specify the address to which we want to send our request and ensure our access rights to the requested data.

https://eaqbe_server:9987/Unity/Search?searchMode=async&ABCDToken=692C0C77FBE1E21DB05B3AAD60E93293
The API is located on the server « eaqbe_server » ( which must have a functional and configured API ) and will receive the request at the endpoint : /Unity/Search. The API will execute the information extraction for us and return it in the specified format, after verifying our access rights ( ABCD Token ).

Relational Database

A relational database organizes information into tables structured with rows and columns, linked by keys - a primary key that uniquely identifies each record in a table, and a foreign key present in another table to create the link -. A relational database allows the use of a SQL ( Structured Query Language ) to query and manipulate this data.

A SQL database offers efficient information management and is a preferred choice for many applications. Although information in a relational database is the most suitable type of source for data utilization in business intelligence ( creating data visualizations ), modeling work is still necessary to make the interaction with visuals as efficient as possible.

Relational Database 500

Non-Relational Database

A non-relational database - NoSQL - is a database that does not use the tabular schema of rows and columns found in most traditional database systems. In the context of increasing data volume, a relational database is not efficient enough. In contrast, a non-relational database allows the storage of large volumes of data. This data can be distributed across multiple machines to reduce maintenance costs. The key difference between a relational and a non-relational database is the way data is stored. One stores data in tables, while the other stores it in a key-value format to accommodate larger quantities. While NoSQL addresses the current challenges of Big Data, it does not replace the relational database but rather complements it.

Data Warehouse

A data warehouse is a centralized system that integrates and stores structured data from various operational and external sources. This historical and current data is regularly updated and is accessible in read-only mode for analysis and decision-making. It is a sort of centralized data library for the company, gathering information from different heterogeneous systems. By applying the concept of ETL ( Extract - Transform - Load ), it attempts to homogenize the information as much as possible to enable further use in analytical models. The data warehouse is organized into several layers.

Data Mart

A data mart is a subset of a data warehouse that organizes data according to specific business uses or targeted domains. It provides a limited view of certain information that has been modeled according to the principles of the star schema in business intelligence to make data exploitation by analytical systems more efficient.

Data Lake

A data lake is a vast reservoir of raw data, stored in its original format, without prior modeling. It addresses the Big Data concept : the data is varied ( structured, unstructured, semi-structured, and comes from multiple sources simultaneously ), the volume, and the speed of information generation are so significant that a traditional database could not handle the queries. To store this data extremely quickly, the information is recorded in a key-value format and can later be reconstructed in other systems.

Information System

An organization's information system refers to the macro and micro « data » environment of a company, encompassing all the hardware, software, and human resources that enable the collection, management, storage, and sharing of information within the organization, as well as with its external environment. It is an organized ecosystem of technologies, processes, and people, designed to manage information efficiently and securely, from collection to dissemination, in support of the organization's goals and its interactions with the external world.