Skip to main content

Concepts

📄️ Data Quality

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? Have you ever been about to sign off after a long day running queries or building data pipelines only to get pinged by your head of marketing that “the data is missing” from a critical report? What about a frantic email from your CTO about “duplicate data” in a business intelligence dashboard? Or a memo from your CEO, the same one who is so bullish on data, about a confusing or inaccurate number in his latest board deck? If any of these situations hit home for you, you’re not alone. These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner.

📄️ Apache Kafka

Apache Kafka is a data streaming platform which allows you to publish, distribute and consume data with high performance, scalability and reliability. It is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.

📄️ Postgres

Picking the right database management system is a difficult task due to the vast number of options on the market. Depending on the business model, you can pick a commercial database or an open source database with commercial support. In addition to this, there are several technical and non-technical factors to assess. When it comes to picking a relational database management system, PostgreSQL stands at the top for several reasons. The PostgreSQL slogan, "The world's most advanced open source database," emphasizes the sophistication of its features and the high degree of community confidence.

📄️ Slowly Changing Dimensions

Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a customer may change. This phenomenon is called a slowly changing dimension (SCD). For historical reporting purposes, it may be necessary to keep a record of the fact that the customer has a change in address. The range of options for dealing with this involves SCD management methodologies referred to as type 1 to type 7. Type 0 is when no changes are allowed to the dimension, for example a date dimension that doesn’t change. The most common types are 1, 2 and 3:

📄️ Data Transformation

Data transformation involves taking source data which has been ingested into your data platform and cleansing it, combining it, and modeling it for downstream use. Historically the most popular way to transform data has been with the SQL language and data engineers have built data transformation pipelines using SQL often with the help of ETL/ELT tools. But recently many folks have also begun adopting the DataFrame API in languages like Python/Spark for this task. For the most part a data engineer can accomplish the same data transformations with either approach, and deciding between the two is mostly a matter of preference and particular use cases. That being said, there are use cases where a particular data transform can't be expressed in SQL and a different approach is needed. The most popular approach for these use cases is Python/Spark along with a DataFrame API.

📄️ Data Warehouses

Enterprises are becoming increasingly data driven, and a key component of any enterprise’s data strategy is a data warehouse—a central repository of integrated data from all across the company. Traditionally, the data warehouse was used by data analysts to create analytical reports. But now it is also increasingly used to populate real-time dashboards, to make ad hoc queries, and to provide decision-making guidance through predictive analytics. Because of these business requirements for advanced analytics and a trend toward cost control, agility, and self-service data access, many organizations are moving to cloud-based data warehouses such as Snowfkake, Amazon Redshift and Google BigQuery.