Labs

📄️ Getting Started with Postgres

How to connect to Postgres

📄️ Data Modeling with Postgres

Sparkify Data Modeling

📄️ Getting Started with Cassandra

Cassandra and Shell

📄️ Data Modeling with Cassandra

Sparkify Data Modeling

📄️ Data Warehousing with Amazon Redshift

Taxi Data Process and Save to Redshift using AWS Wrangler

📄️ Data Warehousing with Snowflake

Connect to Snowflake using Python

📄️ Data Lake on AWS

Build a simple data pipeline using Python and AWS Data Lake

📄️ Converting CSV to Parquet with AWS Lambda Trigger

Create an S3 bucket and IAM user with user-defined policy. Create Lambda layer and lambda function and add the layer to the function. Add S3 trigger for auto-transformation from csv to parquet and query with Glue.

📄️ Data Transformation with Glue Studio ETL

You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.

📄️ Data Transformation with dbt

Step 1: Install the libraries

📄️ Data Transformation with PySpark

Building an ETL Pipeline with databricks Pyspark and AWS S3

📄️ Getting Started with Airflow

Bash echo

📄️ ACLED ETL Pipeline with Airflow

Process flow

📄️ Real-time ETL pipeline with Kafka and Spark Streaming

Process flow

📄️ Data Pipeline with dbt, Airflow and Great Expectations

Data quality has become a much discussed topic in the fields of data engineering and data science, and it’s become clear that data validation is crucial to ensuring the reliability of data products and insights produced by an organization’s data pipelines. Apache Airflow and dbt (data build tool) are among the prominent open source tools in the data engineering ecosystem, and while dbt offers some data testing capabilities, another open source data tool, Great Expectations, enhances the pipeline with data validation and can add layers of robustness.

📄️ Data Pipeline with Databricks PySpark and Superset

Put on your data engineer hat! In this project, you’ll build a modern, cloud-based, three-layer data Lakehouse. First, you’ll set up your workspace on the Databricks platform, leveraging important Databricks features, before pushing the data into the first two layers of the data lake. Next, using Apache Spark, you’ll build the third layer, used to serve insights to different end-users. Then, you’ll use Delta Lake to turn your existing data lake into a Lakehouse. Finally, you’ll deliver an infrastructure that allows your end-users to perform specific queries, using Apache Superset, and build dashboards on top of the existing data. When you’re done with the projects in this series, you’ll have a complete big data pipeline for a cloud-based data lake—and you’ll understand why the three-layer architecture is so popular.

📄️ Building a Data Pipeline for Sparkify Music Company

Process

📄️ Funflix Data Pipeline

Data Modeling

📄️ Sakila Music Company

The Sakila sample database is made available by MySQL and is licensed via the New BSD license. Sakila contains data for a fictitious movie rental company, and includes tables such as store, inventory, film, customer, and payment. While actual movie rental stores are largely a thing of the past, with a little imagination we could rebrand it as a movie-streaming company by ignoring the staff and address tables and renaming store to streaming_service.

Labs

📄️ Workspace Setup

📄️ Getting Started with Postgres

📄️ Data Modeling with Postgres

📄️ Getting Started with Cassandra

📄️ Data Modeling with Cassandra

📄️ Data Warehousing with Amazon Redshift

📄️ Data Warehousing with Snowflake

📄️ Data Lake on AWS

📄️ Delta Lake Deep-dive

📄️ Converting CSV to Parquet with AWS Lambda Trigger

📄️ Data Transformation with Glue Studio ETL

📄️ Data Transformation with dbt

📄️ Data Transformation with PySpark

📄️ Getting Started with Airflow

📄️ ACLED ETL Pipeline with Airflow

📄️ Real-time ETL pipeline with Kafka and Spark Streaming

📄️ Data Pipeline with dbt, Airflow and Great Expectations

📄️ Data Pipeline with Databricks PySpark and Superset

📄️ Building a Data Pipeline for Sparkify Music Company

📄️ Funflix Data Pipeline

📄️ Sakila Music Company

📄️ Hospital Data Analysis with SQL

📄️ Ecommerce Data Analysis with SQL

📄️ Order Analysis with Redshift SQL

📄️ Movie Sentiment Analysis

📄️ Other Projects

📄️ Leetcode and HackerRank SQL

Complete list of Labs

Labs

📄️ Workspace Setup

📄️ Getting Started with Postgres

📄️ Data Modeling with Postgres

📄️ Getting Started with Cassandra

📄️ Data Modeling with Cassandra

📄️ Data Warehousing with Amazon Redshift

📄️ Data Warehousing with Snowflake

📄️ Data Lake on AWS

📄️ Delta Lake Deep-dive

📄️ Converting CSV to Parquet with AWS Lambda Trigger

📄️ Data Transformation with Glue Studio ETL

📄️ Data Transformation with dbt

📄️ Data Transformation with PySpark

📄️ Getting Started with Airflow

📄️ ACLED ETL Pipeline with Airflow

📄️ Real-time ETL pipeline with Kafka and Spark Streaming

📄️ Data Pipeline with dbt, Airflow and Great Expectations

📄️ Data Pipeline with Databricks PySpark and Superset

📄️ Building a Data Pipeline for Sparkify Music Company

📄️ Funflix Data Pipeline

📄️ Sakila Music Company

📄️ Hospital Data Analysis with SQL

📄️ Ecommerce Data Analysis with SQL

📄️ Order Analysis with Redshift SQL

📄️ Movie Sentiment Analysis

📄️ Other Projects

📄️ Leetcode and HackerRank SQL

Complete list of Labs​

Complete list of Labs