Data Modeling with Postgres
Sparkify Data Modeling
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the application, as well as a directory with JSON meta-data on the songs in their application.
They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. The role of this project is to create a database schema and ETL pipeline for this analysis.
We will model the data with Postgres and build an ETL pipeline using Python. The fact and dimension tables for a star database schema for a particular analytic focus is defined, and an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL was developed.
Songs dataset
Songs dataset is a subset of Million Song Dataset. Each file in the dataset is in JSON format and contains meta-data about a song and the artist of that song.
Sample record:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
Logs dataset
Logs dataset is generated by Event Simulator. These log files in JSON format simulate activity logs from a music streaming application based on specified configurations.
Sample record:
{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}
Follow this link for more information.
Funflix Data Modeling
This is assignment-oriented.
Funflix's Medical division needs your help in getting some insights from their datasets. You have to perform the following steps:
Create a schema
funflix
.Upload the 4 data files in this schema. Table names will be same as file names.
Create a star schema (Entity-Relation Diagram). Use https://dbdiagram.io/ to create it. Once created, export the diagram as png and save it as
erd.png
.Write SQL queries to answer the following questions. For each of these queries, create a view in the same schema.
Query View name How many customers visited the clinic in february 2022? customer_feb22 What was the most booked service in march 2022? service_mar22 Who were the top-5 customers who visited the most in Q1 2022 (i.e. Jan-Mar 2022)? customer_top5_q122 What are the top-3 most booked services by the most visited customer? service_top3 Which therapist is most experienced in physiotherapy? therapist_experience Push the code changes to your branch.
Create PR and add the instructor and a peer for review.
Follow this link for more information.
eWallet Data Modeling
This is assignment-oriented.
Background
One of the online retail company’s features is an e-wallet service, that holds credit that can be used to pay for products purchased on the platform.
Users can receive credit in three different ways:
- When a product purchase that is paid for is canceled, the money is refunded as cancellation credit.
- Users can receive gift card credit as a gift.
- If a user has a poor service experience, soo-sorry credit may be provided.
Credit in the e-wallet expires after 6 months if it is gift card credit and soo-sorry credit, but in 1 year if it is cancellation credit.
Requirement
The Finance department of the company would like to build reporting and analytics on the e-wallet service so they can understand the extent of the wallet liabilities the company has.
Some of the questions they would want to answer from this are like below:
- What is the daily balance of credit in the e-wallet service?
- How much credit will expire in the next month?
- What is the outcome (i.e. % used, % expired, % left) of credit given in a particular month?
Solution Design
The four key decisions made during the design of a dimensional model include:
- Select the business process. 2. Declare the grain. 3. Identify the dimensions. 4. Identify the facts.
Let’s write down this decision steps for our e-Wallet case:
Assumptions: Design is developed based on the background (Business Process) given but also keeping flexibility in mind. All the required fields are assumed to be available from the company’s transactional database.
Grain definition: Atomic grain refers to the lowest level at which data is captured by a given business process.
The lowest level of data that can be captured in this context is wallet transactions i.e., all the credit and debit transactions on e-wallet.
- Dimensions: Dimensions provide the “who, what, where, when, why, and how” context surrounding a business process event.
Dimension Tables
DimWallet:
Columns | Comment |
---|---|
Wallet_Id | Unique identifier for wallet credit |
Type | Wallet credit type ('giftcard','cancellation','goodwill') |
Start_Date | Wallet credit start date |
Expiry_Date | Wallet credit expiry date |
Wallet_price | Price of wallet credit |
DimCustomer:
Columns | Comment |
---|---|
Customer_ID | Surrogate key used to uniquely identify Customer details |
dim_Customer_ID | Unique identifier for customer |
First_Name | First name of the customer |
Last_Name | Second name of the customer |
Gender | Gender of the customer |
Birth_Date | Date of birth of customer |
Email address of the customer | |
Address | Resident address of the customer |
Start_Date | To handle Slowly Changing Dimension of customer details of Customers like address etc |
End_Date | To handle SCD |
FactWallet:
Columns | Comment |
---|---|
Transaction_Id | Standalone primary key for fact |
Customer_ID | Foreign key to DimCustomer |
Transaction_Date | Date of transaction and foriengn key to DimDate |
Wallet_Id | Foreign key to DimWallet |
Type | Type of transaction (Credit, Debit) |
Credit | Credit amount |
Debit | Debit amount |
Assignment
- Load the Tables into Database
- Create views to answer the 3 questions asked by the finance department team
- Create PR for your solution and add Instructor and a peer for review
Follow this link for more information.