More

thibautdr · on June 24, 2024

I wrote an article questioning the use of Pandas for ETL. I invite you to read it: https://medium.com/@thibaut_gourdel/should-you-use-pandas-fo...

thibautdr · on June 19, 2024

Yes, there are similarities, but Elyra allows you to develop orchestration pipelines for Python scripts and notebooks, so you still have to write your own code. With Amphi, you design your data pipelines using a graphical interface, and it generates the Python code to execute. Hope that helps.

thibautdr · on June 19, 2024

Thanks for the great questions:

1. As far as I know, there isn't a "standard" file format for low-code pipelines.

2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.

3. Funny you ask, I actually tried to make Amphi run in the browser with WASM. I think it's still too early in terms of both performance and limitations. Performance will likely improve soon, but browser limitations currently prevent the use of sockets, which are indispensable for database connections, for example.

thibautdr · on June 19, 2024

Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.

paulvnickerson · on June 19, 2024

Cool. We use it a lot at work for working with large data sets on a GPU cluster.

thibautdr · on June 19, 2024

Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).

mritchie712 · on June 19, 2024

I understand that. I'd change the title / H1 though, "Open Source Python ETL" doesn't describe what you're building very well.

Good luck! Looks cool.

thibautdr · on June 19, 2024

Hey, not sure I get your point here. I believe the abstraction provides what you're describing. You can swap a file input with a table input without touching the rest of the components (provided you don't have major structural changes). Let me know what you meant :)

thibautdr · on June 19, 2024

Thanks for pointing that out, it's actually mentioned (Extract, transform and load ...) in the very first sentence below the tagline, but if you didn't get it then it's not clear.

thibautdr · on June 19, 2024

Thanks! Don't hesitate to give it a try and reach out if you need anything :)

thibautdr · on June 19, 2024

Thanks for your question. Amphi generates Python code using Pandas and can scale on a single machine or even multiple machines using Modin, but the process is manual for now. Future plans include deploying pipelines on Spark clusters and other services such as Snowflake.

Kalanos · on June 19, 2024

what about dask?

thibautdr · on June 19, 2024

Using Modin, deploying the pandas code on Dask should be possible: https://modin.readthedocs.io/en/stable/development/using_pan...

thibautdr · on June 19, 2024

Thanks for your comment! I do believe it depends on who you ask and ultimately both will co-exist. I also think low-code solutions democratize access to ETL development offering a significant productivity advantage for smaller teams. With Amphi, I'm trying to avoid the common pitfalls of other low-code ETL tools, such as scalability issues, inflexibility, and vendor lock-in, while embracing the advantages of modern ETL-as-code: - Pipelines are defined as JSON files (git workflow available) - Generates non-proprietary Python code: This means the pipelines can be deployed anywhere, such as AWS Lambda, EC2, on-premises, or Databricks.

banku_brougham · on June 19, 2024

Im very leery of low code, but I like the idea of ETL defined as configuration.

stoperaticless · on June 19, 2024

Etl as text is good, because you can save it in version control. (Is it “code” or “json” is irrelevant for the vcs)

Edit: save in vcs stringly implies usability of ‘diff’ and ‘grep’