Image by author
For beginners in any data field, it’s often tough to really understand what a particular data field is about. You can read theoretical explanations and job descriptions and listen to YouTube videos explaining them, but your understanding always stays at that I-get-it-but-not-quite level.
The same is true with data engineering. Of course, you need to know what data engineering is and what data engineers do. And we’ll start with that. But you should complement this theoretical knowledge with practice; at their intersection lies real knowledge.
Practicing data engineering is quite difficult without actually working at a company as a data engineer. This is mainly because data engineering is not only about handling data but also about data architecture and building data infrastructure.
However, there’s a way, and the way is doing data engineering projects. Knowing what data engineers do will help us select suitable projects for mastering data engineering.
What is Data Engineering?
Data engineering ensures data flows – in batches or in real-time – from multiple and various data sources to data storage, where it’s available to data users. In between, data is also processed, analyzed, and transformed into a format suitable for use.
This is called a data pipeline, and the data engineer’s job is to build and maintain it.
From that description, we can extract crucial aspects of data engineering:
- Data transformation & processing
- Data visualization
- Data pipelines
- Data storage
To master data engineering, your projects should focus on or include some of these topics.
Due to the nature of data engineering, it’s impossible to think of a project that will deal with only one aspect of it; such is the wholesomeness of a data engineer’s job. It isn’t really possible to do a project that only does data processing – OK, but where does this data come from, and where does it end?
So, most projects I’ve chosen are end-to-end data engineering projects that will teach you how to build a data pipeline – the essence of data engineering. However, the projects take different approaches and different technologies, so there are some aspects you can learn from one project that you can’t learn from another.
Data Engineering Project Ideas
Image by author
Doing projects teaches you what data engineering is in practice. To complete a project, you must show various technical skills, familiarity with common data engineering tools, and an understanding of the whole process.
This makes projects ideal for learning.
1. Data Pipeline Development Project
You don’t get more data engineering than building a data pipeline. Ensuring data flow from its sources to data users and, by extension, supporting data-driven decision-making is at the heart of data engineering.
By doing a data pipeline development project, you will learn about integrating data from various sources and the whole ETL process.
Project Suggestion
Link: AWS End-to-End Data Engineering by CodeWith You (Yusuf Ganiyu)
Description: This is an excellent project whose goal is to build a data pipeline that will extract data from Reddit, transform it, and then load it into the Redshift data warehouse.
The video guides you through every step, and the project’s source code is also available on GitHub.
Technologies Used:
2. Data Transformation Project
Transforming data means it’s changed into standardized formats compatible with analytical tools and suitable for analysis.
Apart from enabling data analysis and decision-making, data transformation also has a vital role in improving data quality, as it involves cleaning and validating data.
Project Suggestion
Link: Chama Data Transformation by StrataScratch
Description: The assignment here is to transform Chama’s data found in three .csv files using whichever programming language you want but following specific transformation rules.
Technologies Used:
3. Data Lake Implementation Project
Data lakes are central repositories that store large amounts of data in their original format. They are essential for handling and analyzing big data. As big data becomes more common in business, data engineers must know how to implement data lakes.
Project Suggestion
Link: End-to-End Azure Data Engineering by Kaviprakash Selvaraj
Description: This Azure Data end-to-end data engineering project uses sales data. It covers topics such as data ingestion, processing, and storing. What makes it interesting is that it outlines the steps for setting up and managing a data lake, namely Azure Data Lake.
Technologies Used:
4. Data Warehousing Project
Data from data lakes is structured and then stored in data warehouses. These serve as central data repositories for business intelligence.
Implementing a data warehouse makes data retrieval more efficient and simplifies data management, along with ensuring data quality and enabling insights into data.
With a data warehousing project, you will learn about data modeling and database management.
Project Suggestion
Link: AWS Data Engineering Project by Ahmed Ali
Description: This end-to-end project uses NYC taxi data with the goal of building an ELT pipeline in AWS. It’s suitable for learning data warehousing since data is loaded in a data warehouse, namely, Amazon Redshift.
Technologies Used:
5. Real-Time Data Processing Project
Processing data in real-time has become increasingly important for businesses to make timely and proactive decisions. Because of that, data engineers must know how to set up a system that will effectively and efficiently process data in real-time.
Project Suggestion
Link: Real-Time Data Streaming by CodeWithYu (Yusuf Ganiyu)
Description: This CodeWithYu video gives you detailed guidance on building a pipeline for data streaming. You will learn how to set up a data pipeline, stream it in real-time, distributed synchronization, data processing, data storage, and containerization.
The data you will work with is generated by the randomuser.me API. Like in one of his videos I linked earlies, this one also has a source code on GitHub.
Technologies used:
6. Data Visualization Project
While data visualization might not be the first thing that comes to mind when thinking about data engineering, it is an important skill for data engineers.
Visualizing data in the context of data engineering usually means creating operational dashboards that show the current state of data pipelines, e.g., the processing speed or the amount of data ingested.
Data engineers may also create dashboards for data stored in a warehouse to help business users get the information they need easier.
Project Suggestion
Link: From Raw to Data Visualization – Data Engineering Project by Naufaldy Erianda
Description: The goal of this project is to extract data from various resources, transform it, and make it available for data visualization. In the end, you will create a dashboard in Looker Studio.
Technologies used:
Conclusion
Data engineering is a complex field that might seem overwhelming, especially to beginners. The easiest to start really understanding what data engineering is all about is by doing data engineering projects.
I suggested six projects that will teach you:
- Building a pipeline
- Transform data
- Implement data lake
- Implement data warehouse
- Build a pipeline for real-time data processing
- Visualize data
Machine learning is increasingly becoming essential for automating various data engineering tasks. So, to not be left behind, look at some of these machine learning projects and data science projects that can also be used to practice data engineering skills.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.