advanced proficiency in SQL and in any scripting language of your choice (C#/pow...

knur · on Jan 11, 2021

I disagree. That's not enough these days.

If you want to build anything mildly interesting, you need to have a solid background on software engineering (building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL), you need to really understand your runtime (e.g. the JVM, and how to tune it when working with massive amounts of data), you need a bit of knowledge about infrastructure, because some of the most specialized and powerful tools do not have yet an established "way of doing things", and the statefulness nature of them make them different from your typical web app deployment.

Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.

slt2021 · on Jan 11, 2021

I believe what you described is a job of Platform Engineer/Systems Engineer/Data lake Architect, especially JVM aspect of it. The interesting job is in the beginning when you build the cluster initially, or do major extension, after that the ops/maintenance is usually outsourced to cheap labor offshore - so this kinda job is personally not for me.

spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.

same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.

dominotw · on Jan 11, 2021

> building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL

What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?