Concepts
📄️ Apache Airflow
Apache Airflow is an open source tool for programmatically authoring, scheduling, and monitoring data pipelines. It has over 9 million downloads per month and an active OSS community. Airflow allows data practitioners to define their data pipelines as Python code in a highly extensible and infinitely scalable way.
📄️ AWS Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
📄️ Amazon Redshift
Amazon Redshift is a data warehousing service optimized for online analytical processing (OLAP) applications. You can start with just a few hundred gigabytes (GB) of data and scale to a petabyte (PB) or more. Designing your database for analytical processing lets you take full advantage of Amazon Redshift's columnar architecture.
📄️ API
REST, GraphQL, and gRPC are the 3 most popular API development technologies in modern web applications. However, choosing one isn’t easy since they all have unique features.
📄️ AWS
AWS Essentials
📄️ Batch vs Stream
| | Batch | Stream |
📄️ Big Data Architectures
Lambda architecture
📄️ BigQuery
BigQuery is server-less, highly scalable, and cost-effective Data warehouse designed for Google cloud Platform (GCP) to store and query petabytes of data. The query engine is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure and without having to create or rebuild indexes.
📄️ Cassandra
Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, row-oriented database. Cassandra bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable, with a query language similar to SQL. Created at Facebook, it now powers cloud-scale applications across many industries.
📄️ Data Modeling
The data model helps us design our database. When building a plane, you don’t start with building the engine. You start by creating a blueprint anschematic. Creating database is just the same, you start with modelling the data. Model is a representation of real data that provide us with characteristic, relation and rules that apply to our data. It doesn’t actually contain any data in it.
📄️ Data Pipeline
What is a Data Pipeline?
📄️ Data Quality
Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? Have you ever been about to sign off after a long day running queries or building data pipelines only to get pinged by your head of marketing that “the data is missing” from a critical report? What about a frantic email from your CTO about “duplicate data” in a business intelligence dashboard? Or a memo from your CEO, the same one who is so bullish on data, about a confusing or inaccurate number in his latest board deck? If any of these situations hit home for you, you’re not alone. These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner.
📄️ Databricks
Databricks was born out of the frustration of the Hadoop vendors and two Apache projects Cloudera. Hadoop does not do well with concurrency and it has huge latency issues. Apache MapReduce is dead and was replaced with Apache Spark to remedy these limitations. Apache Spark has problems of its own and thus Databricks was born to take Spark to Enterprise.
📄️ Data Lakes and Lakehouses
Before the cloud data lake architecture
📄️ dbt
dbt (data build tool) is an open source Python package that enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. dbt allows you to build your data transformation pipeline using SQL queries.
📄️ DevOps
Infrastructure as Code
📄️ DynamoDB
Getting Started with DynamoDB
📄️ ELT vs ETL
| Name | ETL | ELT |
📄️ AWS EMR
1. Introduction to EMR Server
📄️ Data Encoding
Programs usually work with data in (at least) two different representations:
📄️ Github
Getting Started with Github
📄️ AWS Glue
AWS Glue Studio
📄️ Apache Kafka
Apache Kafka is a data streaming platform which allows you to publish, distribute and consume data with high performance, scalability and reliability. It is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.
📄️ Amazon Kinesis
Watch these videos:
📄️ AWS Lambda
Lambda Function
📄️ OLTP vs OLAP
📄️ Data Partitioning
Partitioning and bucketing are used to maximize benefits while minimizing adverse effects. It can reduce the overhead of shuffling, the need for serialization, and network traffic. In the end, it improves performance, cluster utilization, and cost-efficiency.
📄️ Postgres
Picking the right database management system is a difficult task due to the vast number of options on the market. Depending on the business model, you can pick a commercial database or an open source database with commercial support. In addition to this, there are several technical and non-technical factors to assess. When it comes to picking a relational database management system, PostgreSQL stands at the top for several reasons. The PostgreSQL slogan, "The world's most advanced open source database," emphasizes the sophistication of its features and the high degree of community confidence.
📄️ Python
Advantages of using Python for Data engineering
📄️ Slowly Changing Dimensions
Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a customer may change. This phenomenon is called a slowly changing dimension (SCD). For historical reporting purposes, it may be necessary to keep a record of the fact that the customer has a change in address. The range of options for dealing with this involves SCD management methodologies referred to as type 1 to type 7. Type 0 is when no changes are allowed to the dimension, for example a date dimension that doesn’t change. The most common types are 1, 2 and 3:
📄️ Serialization and Compression
Data engineers working in the cloud are generally freed from the complexities of managing object storage systems. Still, they need to understand details of serialization and deserialization formats.
📄️ Snowflake
Snowflake is the Data Cloud that enables you to build data-intensive applications without operational burden, so you can focus on data and analytics instead of infrastructure management.
📄️ Apache Spark
Apache Spark is an Open source analytical processing engine for large scale powerful distributed data processing and machine learning applications. Spark is Originally developed at the University of California, Berkeley’s, and later donated to Apache Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been contributed by thousands of engineers and made Spark one of the most active open-source projects in Apache.
📄️ SQL vs. NoSQL
SQL
📄️ System Design
How to Approach
📄️ Data Transformation
Data transformation involves taking source data which has been ingested into your data platform and cleansing it, combining it, and modeling it for downstream use. Historically the most popular way to transform data has been with the SQL language and data engineers have built data transformation pipelines using SQL often with the help of ETL/ELT tools. But recently many folks have also begun adopting the DataFrame API in languages like Python/Spark for this task. For the most part a data engineer can accomplish the same data transformations with either approach, and deciding between the two is mostly a matter of preference and particular use cases. That being said, there are use cases where a particular data transform can't be expressed in SQL and a different approach is needed. The most popular approach for these use cases is Python/Spark along with a DataFrame API.
📄️ Data Warehouses
Enterprises are becoming increasingly data driven, and a key component of any enterprise’s data strategy is a data warehouse—a central repository of integrated data from all across the company. Traditionally, the data warehouse was used by data analysts to create analytical reports. But now it is also increasingly used to populate real-time dashboards, to make ad hoc queries, and to provide decision-making guidance through predictive analytics. Because of these business requirements for advanced analytics and a trend toward cost control, agility, and self-service data access, many organizations are moving to cloud-based data warehouses such as Snowfkake, Amazon Redshift and Google BigQuery.