How to select the best ETL tool?

Preet Mehta
3 min readJan 31, 2022

How to judge an ETL tool?

The four most important factors to consider include
environment, architecture, automation, and reliability.

Environment:

More modern approaches leverage the power of the cloud. If your data warehouse runs on the cloud, you want a cloud-native data integration tool that was architected from the start for ELT.

Architecture:

Speaking of ELT, another important consideration is the architectural difference between ETL and ELT. As we have previously discussed, ETL requires high upfront monetary and labor costs, as well as ongoing costs in the form of constant revision.

By contrast, ELT radically simplifies data integration by decoupling extraction and loading from transformations, making data modeling a more analyst-centric rather than engineering-centric activity.

Automation:

The advantages of ELT and cloud computing are significantly diminished if you have to involve skilled DBAs or data engineers every time you replicate new data. Reading and understanding the schema, making necessary adaptations to move the data from one to the other, all of that should be automatic. The point of an ETL tool is to avoid coding.

Reliability:

A reliable data pipeline has high uptime and delivers data with high fidelity. Part of the reason you choose an ETL tool is so you don’t have to worry about how your data pipeline will recover from failure. Your provider should be able to route around problems and redo replications without incurring data duplication or (maybe worse) missing any data

Source

Additional comparison points:

Extract:
- Check connecting with multiple data sources (Postgres, Apache kafka, Oracle, SAP, IBM, MySQL…)
- Does it allow connecting with files with all extenstions (JSON and XML)
- What will happen if the schema of the data source changes? (Will the pipeline be able to handle the change?)
- Update frequency (Allowed Data velocity)

Transform:
- Usually the data needs to be changed or enhanced to better serve the needs of the downstream user.
- From simple changes to the source data, to multi-step processing that interacts with multiple systems.

Load:
- Destination: Data ware house or data lake
- Does it allow bulk load and/or parallel loads that dramatically reduce the total time to load the data
- Is it designed for analytical workloads? (Means can BI tool be connected with it to visualize the data)

General points:
- Price?
- Is the ETL flexible enough to allow any new steps in the pipeline without impacting/able to handle the downstream processing?
- In case of corrupt data, or network failures, any error condition, how will the ETL tool behave? (Error Handling)
- What shall be the total time in processing a pipeline as the data keeps on growing? (Performance tuning)
- How scalable is the ETL tool in case of increasing data loads, network failure, third-party service invocations and parallelized data loads? (Scalability)

Types of ETL tools:

1. Enterprise Software ETL:
- Informatica
- IBM
- Oracle
- SSIS
- SAP
- Fivetran

2. Open source ETL:
- Talend
- Pentaho
- Hadoop
- Airbyte Cloud

3. Custom ETL tools:
- SQl
- Pyhton
- Java
- Spark & Hadoop

4. Cloud
- AWS EMR
- AWS Glue
- AWS Data pipeline
- Google Cloud Data flow
- Saas Segment

.

--

--

Preet Mehta

I work as a Data Analyst with LinkedIn Corporation Inc. Originally from Gujarat, I'm passionate about teaching, networking, and philosophy. I'm 27 years old.