As I said numerous time in this blog, my plan is to become a software/data engineer so I think that one of the next macro-topics I want to cover is the ones expressed in Deciphering Data Architecture book. I need to acquire a more specific knowledge about Data Engineering and I think it’s a good starting point. When I think about what I need, it is simply overwhelming: data engineering is so waste field that deciding what to start from is hard. Should I start from this book or from classical authors like Kimball and Inmon? Or maybe is it better to go with Martin Klepmann and his book Designing Data Intensive Application? Maybe go with James Serra? Go with cloud-based platform like Databricks, BQ, or AWS? Go with massive sessions on LeetCode solving SQL problems? Or maybe other directions I don’t know? Actually I don’t know and I have no one to ask to. YouTube channels and Reddit threads are full of contradictions, everyone has its own opinion about how to start with that, but no one convinced me. What convinced me instead is what Paolo Platter (from AgileLab) said to an interview on the Datapizza YouTube channel. Paolo Platter is the co-founder of a consultant company specialized in Data Engineering and he really seems to know what he is talking about (unfortunately, it is not so obvious). I don’t know why, but i trust him and in the interview I linked above, he specifically said: “We are looking for people who primarily are good Software Engineers”. That’s the real reason why I am focusing on studying Python. Answers to potential and legitimate objections:
- I know that software engineer is not just about knowing Python, but I think it’s a necessary but not sufficient skill.
- Python is not the only possible language. Maybe you will need to use Scala and not Python, but once you are familiar with the fundamental and underlying concept about a programming language, it should be easy to it is easy to transport them to the new programming language.
- “Man, to create a software you need to know Design Pattern, Apache Airflow, Apache Spark, Apache Kafka, Docker, Kubernetes, SQL, Database System Design, Lambda Function and Step Function on AWS, Dataform on Google Cloud Platform, Workflow tool on Databricks, CI/CD pipeline, Data Modeling, Infrastructure as Code tool, [… then an infinite list of tool]”. I know man, but I have to start with something!
If anyone, other than me and maybe my mother, will read this post and can help me out on this phase please contact me.