Hacker Newsnew | past | comments | ask | show | jobs | submit | thibautdr's commentslogin

I wrote an article questioning the use of Pandas for ETL. I invite you to read it: https://medium.com/@thibaut_gourdel/should-you-use-pandas-fo...


Yes, there are similarities, but Elyra allows you to develop orchestration pipelines for Python scripts and notebooks, so you still have to write your own code. With Amphi, you design your data pipelines using a graphical interface, and it generates the Python code to execute. Hope that helps.


Thanks for the great questions:

1. As far as I know, there isn't a "standard" file format for low-code pipelines.

2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.

3. Funny you ask, I actually tried to make Amphi run in the browser with WASM. I think it's still too early in terms of both performance and limitations. Performance will likely improve soon, but browser limitations currently prevent the use of sockets, which are indispensable for database connections, for example.


Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.


Cool. We use it a lot at work for working with large data sets on a GPU cluster.


Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).


I understand that. I'd change the title / H1 though, "Open Source Python ETL" doesn't describe what you're building very well.

Good luck! Looks cool.


Hey, not sure I get your point here. I believe the abstraction provides what you're describing. You can swap a file input with a table input without touching the rest of the components (provided you don't have major structural changes). Let me know what you meant :)


Thanks for pointing that out, it's actually mentioned (Extract, transform and load ...) in the very first sentence below the tagline, but if you didn't get it then it's not clear.


Thanks! Don't hesitate to give it a try and reach out if you need anything :)


Thanks for your question. Amphi generates Python code using Pandas and can scale on a single machine or even multiple machines using Modin, but the process is manual for now. Future plans include deploying pipelines on Spark clusters and other services such as Snowflake.


what about dask?


Using Modin, deploying the pandas code on Dask should be possible: https://modin.readthedocs.io/en/stable/development/using_pan...


Thanks for your comment! I do believe it depends on who you ask and ultimately both will co-exist. I also think low-code solutions democratize access to ETL development offering a significant productivity advantage for smaller teams. With Amphi, I'm trying to avoid the common pitfalls of other low-code ETL tools, such as scalability issues, inflexibility, and vendor lock-in, while embracing the advantages of modern ETL-as-code: - Pipelines are defined as JSON files (git workflow available) - Generates non-proprietary Python code: This means the pipelines can be deployed anywhere, such as AWS Lambda, EC2, on-premises, or Databricks.


Im very leery of low code, but I like the idea of ETL defined as configuration.


Etl as text is good, because you can save it in version control. (Is it “code” or “json” is irrelevant for the vcs)

Edit: save in vcs stringly implies usability of ‘diff’ and ‘grep’


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: