Defining data integration techniques, their use-cases, and pros and cons.
Starting in the 1970’s, the dominant way to integrate data for analysis was the extract, transform, load (ETL) method. ETL’s success partially came from how it curated and structured data, conserving space and saving on storage costs.
In the 2010’s, as historically low storage costs and growing big data capabilities emerged, users began to shift to a more agile alternative — the extract, load, transform (ELT) method.
Both ETL and ELT create a data pipeline with the same three steps — extract, transform, and load.
Extract — Raw data is copied or extracted from siloed sources. Sources can include marketing/sales data, inventory/ERP, financial/accounting data, or even data from an internal product database.
Transform — Data is processed to prepare it for analysis. This may include filtering, deduplication, enriching, cleansing, modeling, authentication, and validation. Personal identifiable information (PII) can be removed at this step to comply with privacy regulations.
Load — Data is moved or copied to the storage location, often a data warehouse or distributed file system, where it will be accessed for analysis.
Though the individual steps within ELT and ETL have the same names, altering the order of the steps leads to big differences in software, hardware, and the types of data each system can store.
What’s ETL?
For ETL data integration, data is sent from its sources to a staging area where it’s transformed before it’s shipped to storage. Which data is collected, how it’s transformed, and where it’s sent are all clearly defined before processing starts.
Once data is in storage, analysis can be conducted quickly because of curation and normalization. Users can then analyze their data to discover business intelligence (BI) insights.
Because of how they process data, ETL systems work exclusively with relational data stores and have little ability to work with unstructured data.
What’s ELT?
For ELT data integration, data is moved in bulk from sources directly into storage where it’s transformed and made ready for analysis. Instead of having separate processes for extracting and loading, ELT can use an EL tool that does both. By featuring fewer data transfers and transformation processes, ELT typically processes data faster than ETL.
ELT also comes with the benefit of supporting storage of unstructured data, which ETL cannot do. ELT systems work best with unstructured non-relational data, but are able to work with homogenous relational data.
You may see ELT sometimes referred to as ETLT because there’s a minor transformation step represented by a ‘T’ that can include basic validation and cleansing of data. In most cases, ELT and ETLT refer to the same process.
For storage, both ETL and ELT can use data warehouses and distributed file systems like Redshift, Snowflake, BigQuery, Hadoop, and Spark.
What’s Relational Data vs. Unstructured Data?
Relational data is data that’s structured so individual values are organized in a specific order. They come in tables with rows and columns like an Excel spreadsheet.
Unstructured data is any data that doesn’t fit into a table with rows and columns. It can include log files, emails, photos, and videos.
For users with modern infrastructure generating higher volumes of data, ELT’s ability to handle bulk data makes it the new standard. But that doesn’t mean it’s the best solution in all cases. As a more mature technology, ETL systems may appeal with flexibility and familiarity. If your organization has lower volumes of data or intensive data processing needs, ELT may still be the best solution.
To see more about how these processes differ, let’s examine the pros and cons of each method.
ETL and ELT Pros and Cons
ETL PROS |
ETL CONS |
ELT PROS |
ELT CONS |
Familiar with developers |
Inefficiency and bottlenecks in processing |
Scalable processing |
Developers have less experience |
Mature technology has more options for configuration |
Costs more at scale |
Costs less at scale |
Potential security issues |
Higher security and compliance standards |
Requires more setup |
Requires less setup |
Newer technology has fewer options for configuration |
Can reduce data before storage which saves on space |
More maintenance required |
Less maintenance required |
|
Better for sophisticated transformations |
|
Speed and agility of data transfer |
|
Maturity of technology
A core difference that drives several of the pros and cons is that ETL is a more mature technology, while ELT is a newer technology. Because ETL has been the industry standard for longer, it has established customizations and is more familiar with developers. However, ELT offers efficiency gains and is catching up in customizations.
Structured vs. unstructured data
Because it’s built on intensive processing and deals with relational, heavily schematized data, ETL is slower than ELT at processing data and making it available to access. Many of the operational distinctions reflect this difference. With ETL, processing is often run on schedules after-hours when demands on systems are lower. In contrast, ELT leverages scalable inexpensive storage that supports unstructured data.
Difference in setup
The ETL paradigm requires clear definitions of what data is important and needs to be stored before data collection starts. Because of the need for more configurations, designing new data models in the ETL requires intensive revisions of the data pipeline. By being configuration-heavy, ETL systems end up requiring more hours of maintenance than ELT.
In comparison, ELT requires less setup and allows users to immediately start collecting raw data in bulk. This enables analysts to create new use cases, design new data models after collection has started, and schedule transformation tasks when it’s most convenient. Because ELT systems are designed with fewer steps to configure, they generally require less overall maintenance.
Security concerns
There are some security and compliance concerns with ELT systems. It’s possible that some PII data may be viewable by system administrators when the data arrives in bulk in storage. In contrast, ETL transforms data earlier in the process, helping protect data for requirements like HIPAA, GDPR, or GLBA. An ETLT process can also work, as PII can be removed from data before it is loaded into the warehouse.
ETL Use Cases
ETL use cases are likely the best solution for these conditions:
-
Lower volumes of data and less frequent updates
-
Need for highly secure, self-hosted servers
-
Capital to pay for self hosting
-
Desired data transformations are compute intensive
-
All data comes from relational systems
-
There are pressures to reduce amount of data storage
ELT use cases
ELT use cases are likely the best solution for these conditions:
-
High volumes of data
-
Application-rich environments
-
Small to medium sized businesses with limited budget for data pipeline infrastructure
-
Need for unstructured data
-
Big data sets required for BI
-
High volumes of cloud data extracted and loaded by open source software like Hadoop
-
Need to work with data lakes
Alternatives to ETL and ELT
ETL and ELT are dominant paradigms for data integration, but there are use cases that may call for alternative techniques.
Different ways of ordering extract, transform, load
An ELTL paradigm can be found when users want to leverage the speed of ELT, but need to add a more intensive step of transformation by data scrubbing before loading in order to meet privacy concerns.
There are TEL instances in which transformation comes first. One example is token burning in blockchain operations. Transformation of data happens before extraction in the blockchain database.
Choose a data integration strategy that fits your future goals
ETL systems present more options to combine technologies and devise systems that meet specific present needs. However, ELT systems provide a more scalable and agile framework that will prove more accommodating as time goes on and data volumes grow.
Whether you choose an ETL or ELT system, Mozart Data makes it easy to extract, transform, and load your data by providing an out-of-the-box data stack. In hours, you can extract and load all of your data into a data warehouse without any engineering. Our easy-to-use transform layer gives anyone the ability to prepare their raw data for analysis and visualization. Reduce your time to insights by 76% by using Mozart Data to remove the barriers between you and the data analysis that can transform your organization.
Contact us to see how Mozart brings all of your data together to make insights easily accessible to all teams.