Skip to main content

Labs

📄️ Data Pipeline with dbt, Airflow and Great Expectations

Data quality has become a much discussed topic in the fields of data engineering and data science, and it’s become clear that data validation is crucial to ensuring the reliability of data products and insights produced by an organization’s data pipelines. Apache Airflow and dbt (data build tool) are among the prominent open source tools in the data engineering ecosystem, and while dbt offers some data testing capabilities, another open source data tool, Great Expectations, enhances the pipeline with data validation and can add layers of robustness.

📄️ Data Pipeline with Databricks PySpark and Superset

Put on your data engineer hat! In this project, you’ll build a modern, cloud-based, three-layer data Lakehouse. First, you’ll set up your workspace on the Databricks platform, leveraging important Databricks features, before pushing the data into the first two layers of the data lake. Next, using Apache Spark, you’ll build the third layer, used to serve insights to different end-users. Then, you’ll use Delta Lake to turn your existing data lake into a Lakehouse. Finally, you’ll deliver an infrastructure that allows your end-users to perform specific queries, using Apache Superset, and build dashboards on top of the existing data. When you’re done with the projects in this series, you’ll have a complete big data pipeline for a cloud-based data lake—and you’ll understand why the three-layer architecture is so popular.

Complete list of Labs