Skip to main content

Data Modeling with Postgres

Sparkify Data Modeling

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the application, as well as a directory with JSON meta-data on the songs in their application.

They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. The role of this project is to create a database schema and ETL pipeline for this analysis.

We will model the data with Postgres and build an ETL pipeline using Python. The fact and dimension tables for a star database schema for a particular analytic focus is defined, and an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL was developed.

Songs dataset

Songs dataset is a subset of Million Song Dataset. Each file in the dataset is in JSON format and contains meta-data about a song and the artist of that song.

Sample record:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Logs dataset

Logs dataset is generated by Event Simulator. These log files in JSON format simulate activity logs from a music streaming application based on specified configurations.

Sample record:

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

Follow this link for more information.

Funflix Data Modeling

info

This is assignment-oriented.

Funflix's Medical division needs your help in getting some insights from their datasets. You have to perform the following steps:

  1. Create a schema funflix.

  2. Upload the 4 data files in this schema. Table names will be same as file names.

  3. Create a star schema (Entity-Relation Diagram). Use https://dbdiagram.io/ to create it. Once created, export the diagram as png and save it as erd.png.

  4. Write SQL queries to answer the following questions. For each of these queries, create a view in the same schema.

    QueryView name
    How many customers visited the clinic in february 2022?customer_feb22
    What was the most booked service in march 2022?service_mar22
    Who were the top-5 customers who visited the most in Q1 2022 (i.e. Jan-Mar 2022)?customer_top5_q122
    What are the top-3 most booked services by the most visited customer?service_top3
    Which therapist is most experienced in physiotherapy?therapist_experience
  5. Push the code changes to your branch.

  6. Create PR and add the instructor and a peer for review.

Follow this link for more information.

eWallet Data Modeling

info

This is assignment-oriented.

Background

One of the online retail company’s features is an e-wallet service, that holds credit that can be used to pay for products purchased on the platform.

Users can receive credit in three different ways:

  1. When a product purchase that is paid for is canceled, the money is refunded as cancellation credit.
  2. Users can receive gift card credit as a gift.
  3. If a user has a poor service experience, soo-sorry credit may be provided.

Credit in the e-wallet expires after 6 months if it is gift card credit and soo-sorry credit, but in 1 year if it is cancellation credit.

Requirement

The Finance department of the company would like to build reporting and analytics on the e-wallet service so they can understand the extent of the wallet liabilities the company has.

Some of the questions they would want to answer from this are like below:

  1. What is the daily balance of credit in the e-wallet service?
  2. How much credit will expire in the next month?
  3. What is the outcome (i.e. % used, % expired, % left) of credit given in a particular month?

Solution Design

The four key decisions made during the design of a dimensional model include:

  1. Select the business process. 2. Declare the grain. 3. Identify the dimensions. 4. Identify the facts.

Let’s write down this decision steps for our e-Wallet case:

  1. Assumptions: Design is developed based on the background (Business Process) given but also keeping flexibility in mind. All the required fields are assumed to be available from the company’s transactional database.

  2. Grain definition: Atomic grain refers to the lowest level at which data is captured by a given business process.

The lowest level of data that can be captured in this context is wallet transactions i.e., all the credit and debit transactions on e-wallet.

  1. Dimensions: Dimensions provide the “who, what, where, when, why, and how” context surrounding a business process event.

Dimension Tables

DimWallet:

ColumnsComment
Wallet_IdUnique identifier for wallet credit
TypeWallet credit type ('giftcard','cancellation','goodwill')
Start_DateWallet credit start date
Expiry_DateWallet credit expiry date
Wallet_pricePrice of wallet credit

DimCustomer:

ColumnsComment
Customer_IDSurrogate key used to uniquely identify Customer details
dim_Customer_IDUnique identifier for customer
First_NameFirst name of the customer
Last_NameSecond name of the customer
GenderGender of the customer
Birth_DateDate of birth of customer
EmailEmail address of the customer
AddressResident address of the customer
Start_DateTo handle Slowly Changing Dimension of customer details of Customers like address etc
End_DateTo handle SCD

FactWallet:

ColumnsComment
Transaction_IdStandalone primary key for fact
Customer_IDForeign key to DimCustomer
Transaction_DateDate of transaction and foriengn key to DimDate
Wallet_IdForeign key to DimWallet
TypeType of transaction (Credit, Debit)
CreditCredit amount
DebitDebit amount

Assignment

  1. Load the Tables into Database
  2. Create views to answer the 3 questions asked by the finance department team
  3. Create PR for your solution and add Instructor and a peer for review

Follow this link for more information.