Labs
📄️ Workspace Setup
Install Anaconda and Jupyter
📄️ Getting Started with Postgres
How to connect to Postgres
📄️ Data Modeling with Postgres
Sparkify Data Modeling
📄️ Getting Started with Cassandra
Cassandra and Shell
📄️ Data Modeling with Cassandra
Sparkify Data Modeling
📄️ Data Warehousing with Amazon Redshift
Taxi Data Process and Save to Redshift using AWS Wrangler
📄️ Data Warehousing with Snowflake
Connect to Snowflake using Python
📄️ Data Lake on AWS
Build a simple data pipeline using Python and AWS Data Lake
📄️ Delta Lake Deep-dive
Setup
📄️ Converting CSV to Parquet with AWS Lambda Trigger
Create an S3 bucket and IAM user with user-defined policy. Create Lambda layer and lambda function and add the layer to the function. Add S3 trigger for auto-transformation from csv to parquet and query with Glue.
📄️ Data Transformation with Glue Studio ETL
You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.
📄️ Data Transformation with dbt
Step 1: Install the libraries
📄️ Data Transformation with PySpark
Building an ETL Pipeline with databricks Pyspark and AWS S3
📄️ Getting Started with Airflow
Bash echo
📄️ ACLED ETL Pipeline with Airflow
Process flow
📄️ Real-time ETL pipeline with Kafka and Spark Streaming
Process flow
📄️ Data Pipeline with dbt, Airflow and Great Expectations
Data quality has become a much discussed topic in the fields of data engineering and data science, and it’s become clear that data validation is crucial to ensuring the reliability of data products and insights produced by an organization’s data pipelines. Apache Airflow and dbt (data build tool) are among the prominent open source tools in the data engineering ecosystem, and while dbt offers some data testing capabilities, another open source data tool, Great Expectations, enhances the pipeline with data validation and can add layers of robustness.
📄️ Data Pipeline with Databricks PySpark and Superset
Put on your data engineer hat! In this project, you’ll build a modern, cloud-based, three-layer data Lakehouse. First, you’ll set up your workspace on the Databricks platform, leveraging important Databricks features, before pushing the data into the first two layers of the data lake. Next, using Apache Spark, you’ll build the third layer, used to serve insights to different end-users. Then, you’ll use Delta Lake to turn your existing data lake into a Lakehouse. Finally, you’ll deliver an infrastructure that allows your end-users to perform specific queries, using Apache Superset, and build dashboards on top of the existing data. When you’re done with the projects in this series, you’ll have a complete big data pipeline for a cloud-based data lake—and you’ll understand why the three-layer architecture is so popular.
📄️ Building a Data Pipeline for Sparkify Music Company
Process
📄️ Funflix Data Pipeline
Data Modeling
📄️ Sakila Music Company
The Sakila sample database is made available by MySQL and is licensed via the New BSD license. Sakila contains data for a fictitious movie rental company, and includes tables such as store, inventory, film, customer, and payment. While actual movie rental stores are largely a thing of the past, with a little imagination we could rebrand it as a movie-streaming company by ignoring the staff and address tables and renaming store to streaming_service.
📄️ Hospital Data Analysis with SQL
Ingestion of the data into Postgres
📄️ Ecommerce Data Analysis with SQL
Basic SQL
📄️ Order Analysis with Redshift SQL
Setup the environment
📄️ Movie Sentiment Analysis
Setup the environment
📄️ Other Projects
Data Transformation with Snowpark Python and dbt
📄️ Leetcode and HackerRank SQL
A set of Leetcode SQL questions is available here.