Skip to main content

SQL vs. NoSQL

SQL

SQL was initially created to be the language for generating, manipulating, and retrieving data from relational databases, which have been around for more than 40 years. Over the past decade or so, however, other data platforms such as Hadoop, Spark, and NoSQL have gained a great deal of traction, eating away at the relational database market. However, the SQL language has been evolving to facilitate the retrieval of data from various platforms, regardless of whether the data is stored in tables, documents, or flat files.

Whether you will be using a relational database or not, if you are working in data engineering, business intelligence, or some other facet of data analysis, you will likely need to know SQL, along with other languages/platforms such as Python and R. Data is everywhere, in huge quantities, and arriving at a rapid pace, and people who can extract meaningful information from all this data are in big demand.

Aggregation, Grouping, Ordering & Filtering

The aggregation functions perform calculations on multiple rows and return the result as one value, collapsing the individual rows in the process. When an aggregation function is used, you need to group the data on fields which are not aggregated by using a GROUP BY clause.

Filtering is used when you need to extract a subset of data by using a specific condition. You can use WHERE and HAVING clauses to filter the data as per your needs.

Joins and Unions

Data Engineers are typically responsible for collating different data sources together and providing a cleaner source of information for the Data Scientists or Data Analysts. In order to collate data from different sources, they need a good understanding of SQL Joins and Unions.

CTEs, Sub Queries, and Window Functions

As the queries become more complex, it’s not always possible to write a single query and do all the analysis. Sometimes, it’s necessary to create a temporary table and use the results in a new query to perform analysis at different levels. We can use a subquery, a CTE (Common Table Expression), or a TEMP (Temporary) table for this. As a Data Engineer, you need to understand these concepts to write better and optimized SQL queries.

Command Categories

Query breakdown

StatementHow to Use ItOther Details
SELECTSELECT Col1, Col2, ...Provide the columns you want
FROMFROM TableProvide the table where the columns exist
LIMITLIMIT 10Limits based number of rows returned
ORDER BYORDER BY ColOrders table based on the column. Used with DESC.
WHEREWHERE Col > 5A conditional statement to filter your results
LIKEWHERE Col LIKE '%me%'Only pulls rows where column has 'me' within the text
INWHERE Col IN ('Y', 'N')A filter for only rows with column of 'Y' or 'N'
NOTWHERE Col NOT IN ('Y', 'N')NOT is frequently used with LIKE and IN
ANDWHERE Col1 > 5 AND Col2 < 3Filter rows where two or more conditions must be true
ORWHERE Col1 > 5 OR Col2 < 3Filter rows where at least one condition must be true
BETWEENWHERE Col BETWEEN 3 AND 5Often easier syntax than using an AND

Common types of NoSQL

Key-value stores

  • Array of key-value pairs. The "key" is an attribute name.
  • Redis, Vodemort, Dynamo.

Document databases

  • Data is stored in documents.
  • Documents are grouped in collections.
  • Each document can have an entirely different structure.
  • CouchDB, MongoDB.

Wide-column / columnar databases

  • Column families - containers for rows.
  • No need to know all the columns up front.
  • Each row can have different number of columns.
  • Cassandra, HBase.

Graph database

  • Data is stored in graph structures
    • Nodes: entities
    • Properties: information about the entities
    • Lines: connections between the entities
  • Neo4J, InfiniteGraph

Differences between SQL and NoSQL

Storage

  • SQL: store data in tables.
  • NoSQL: have different data storage models.

Schema

  • SQL
    • Each record conforms to a fixed schema.
    • Schema can be altered, but it requires modifying the whole database.
  • NoSQL:
    • Schemas are dynamic.

Querying

  • SQL
    • Use SQL (structured query language) for defining and manipulating the data.
  • NoSQL
    • Queries are focused on a collection of documents.
    • UnQL (unstructured query language).
    • Different databases have different syntax.

Scalability

  • SQL
    • Vertically scalable (by increasing the horsepower: memory, CPU, etc) and expensive.
    • Horizontally scalable (across multiple servers); but it can be challenging and time-consuming.
  • NoSQL
    • Horizontablly scalable (by adding more servers) and cheap.

ACID

  • Atomicity, consistency, isolation, durability
  • SQL
    • ACID compliant
    • Data reliability
    • Gurantee of transactions
  • NoSQL
    • Most sacrifice ACID compliance for performance and scalability.

Which one to use?

SQL

  • Ensure ACID compliance.
    • Reduce anomalies.
    • Protect database integrity.
  • Data is structured and unchanging.

NoSQL

  • Data has little or no structure.
  • Make the most of cloud computing and storage.
    • Cloud-based storage requires data to be easily spread across multiple servers to scale up.
  • Rapid development.
    • Frequent updates to the data structure.

Resources

  1. SQL Cheat Sheet
  2. What I realized after solving 100 leetcode SQL questions
  3. Udacity Nanodegree on SQL
  4. SQL Leetcode All questions set
  5. LeetCode SQL Problem Solving Questions With Solutions
  6. HackerRank SQL Problem Solving Questions With Solutions
  7. SQL Interview Questions
  8. Data Engineer Interview Questions
  9. SQL Case Studies
  10. SQL Commands Categories
  11. 10 SQL Queries You Should Know as a Data Engineer
  12. SQL JOIN Interview Questions
  13. The Ultimate Guide to SQL Window Functions