SQL & Database Engineering

Overview

Every application that stores data depends on its database layer for correctness, performance, and reliability. The database schema that correctly models the business domain, the queries that retrieve data efficiently at production volumes, the indexes that make those queries fast, the transactions that maintain data integrity under concurrent access, the backup processes that protect against data loss, and the migration strategy that evolves the schema safely as requirements change — these are the engineering decisions that determine whether a data layer serves its application well or becomes the bottleneck and risk point that limits the application's capabilities.

Database engineering is not a secondary concern that gets addressed after the application code works. Schema design decisions made early in a project have consequences that accumulate over the application's lifetime — the missing index that causes acceptable query performance at development data volumes to become a production incident at scale, the absent foreign key constraint that allows referential integrity to erode over months of production use, the denormalisation that seemed like a performance optimisation but became a consistency problem, the schema design that does not model the domain correctly and requires convoluted queries to answer the business questions the application actually needs to answer.

We design and implement database layers across the projects we build — using PostgreSQL and MySQL as our primary relational databases, with Redis for caching, session storage, and real-time data, and with the database-specific features (JSON columns, full-text search, time-series extensions, window functions) that specific application requirements call for.

What Database Engineering Covers

Schema design. The database schema is the formal model of the application's data — the tables that represent domain entities, the columns that represent their attributes, the constraints that enforce data integrity, and the relationships between tables that reflect the structure of the domain.

Normalisation: the process of organising the schema to eliminate data redundancy and update anomalies. Third Normal Form (3NF) as the baseline for relational schema design — each non-key column dependent on the primary key, the whole primary key, and nothing but the primary key. Practical normalisation decisions that balance theoretical correctness with query performance — when to accept controlled denormalisation to avoid expensive joins on hot query paths, and what integrity mechanisms to put in place when denormalisation is chosen.

Data types: choosing the correct data type for each column — the precision implications of NUMERIC versus FLOAT for financial data where exact decimal arithmetic matters, the timezone handling differences between TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE, the TEXT versus VARCHAR tradeoffs in PostgreSQL (where they are largely equivalent in storage but carry different semantic intent), the JSONB column for semi-structured data that needs some indexing but full schema flexibility.

Constraints: the database-enforced rules that maintain data integrity independently of application code. NOT NULL constraints that prevent null values in columns that must always have a value. UNIQUE constraints that enforce uniqueness beyond primary keys. CHECK constraints that enforce domain invariants — a price column that cannot be negative, a status column that must be one of a defined set of values. FOREIGN KEY constraints that enforce referential integrity between tables, preventing orphaned records that reference non-existent parents.

Primary key design: surrogate integer keys (SERIAL or BIGSERIAL in PostgreSQL, AUTO_INCREMENT in MySQL) for simplicity and join efficiency, UUIDs (UUID v4 for random, UUID v7 for time-sortable) for distributed generation without coordination, natural keys where the business domain has a natural unique identifier. The primary key choice that is appropriate for the application's scale, distribution requirements, and join patterns.

Indexing strategy. Indexes are the mechanism that makes queries fast — the data structures that allow the database to find rows matching query predicates without scanning every row in the table. The right indexes make queries fast; missing indexes cause full table scans; wrong indexes consume space and slow writes without benefiting reads.

B-tree indexes: the default index type — appropriate for equality and range predicates on columns with reasonable cardinality. The single-column index that speeds up a single filter condition. The composite index that speeds up queries with multiple filter conditions — with column ordering that matches the most selective filter first, and that places equality conditions before range conditions.

Partial indexes: indexes that include only the rows matching a WHERE clause — the index on WHERE status = 'pending' that makes queries for pending records fast without including the much larger set of completed records. Partial indexes that are smaller, faster to maintain, and more selective than full-column indexes for columns with highly skewed value distributions.

Expression indexes: indexes on computed expressions — LOWER(email) for case-insensitive email lookups, DATE(created_at) for date-level grouping, (metadata->>'field') for JSONB field access. The index that makes a frequently used expression fast without requiring the expression to be materialised in a separate column.

Index analysis: using EXPLAIN ANALYZE to understand how queries are executed — whether the planner is using the available indexes, whether the estimated row counts are accurate (if not, a sign that statistics need to be updated), and what the actual execution time of each step is. The query plan analysis that identifies why a query is slow and what index or schema change would address it.

Query optimisation. The SQL that retrieves data efficiently from the schema, written to allow the query planner to use the available indexes and avoid unnecessary work.

Join strategies: INNER JOIN, LEFT JOIN, and the other join types — choosing the correct join type for each relationship, understanding that unnecessary outer joins prevent join elimination optimisations, and structuring multi-table queries so the query planner has the statistics to choose an efficient join order.

Subquery versus CTE versus join: the tradeoffs between correlated subqueries (which may execute once per outer row), CTEs (which materialise their results in some databases, providing isolation but potentially preventing optimisations), and joins (which the planner can optimise across). The query structure that gives the planner the most freedom to optimise.

Window functions: the SQL standard analytical functions — ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER(), AVG() OVER() — that compute values across related rows without collapsing them into aggregates. The running total, the moving average, the previous row value, the percentile rank — calculations that window functions express cleanly and that without them require either self-joins or application-level processing of query results.

Aggregation and grouping: GROUP BY with the appropriate aggregate functions, HAVING for filtering on aggregate values, the grouping sets (ROLLUP, CUBE, GROUPING SETS) for multi-dimensional aggregations. The aggregation query that directly answers the reporting question rather than retrieving raw data for application-level aggregation.

Common Table Expressions: the WITH clause that breaks complex queries into named, readable steps — the recursive CTE that traverses hierarchical data (organisation trees, bill of materials, threaded comments), the CTE that computes an intermediate result referenced multiple times in the outer query.

Transactions and concurrency. The mechanisms that maintain data integrity when multiple concurrent operations access the database simultaneously.

Transaction isolation levels: Read Committed (the PostgreSQL default) where a transaction sees only committed data but may see different values if other transactions commit between reads, Repeatable Read where a transaction sees a consistent snapshot throughout its duration, and Serializable where transactions execute as if they ran sequentially. The isolation level appropriate to each operation — most read operations are fine with Read Committed, financial operations that read and write related records may require Repeatable Read or Serializable.

Explicit locking: SELECT FOR UPDATE to lock rows that will be modified within the transaction, preventing concurrent transactions from modifying the same rows and causing lost updates. SELECT FOR SHARE for shared locks that allow concurrent reads but prevent concurrent writes. Advisory locks for application-level mutual exclusion implemented in the database.

Deadlock prevention: structuring operations to acquire locks in consistent order across all transactions, minimising the duration of transactions, and understanding the deadlock detection and retry logic that production applications require.

Optimistic concurrency: version columns (updated_at timestamp or integer version counter) that allow concurrent reads without locking and detect conflicts at write time — the pattern for applications where lock contention would be a performance problem but conflicts are infrequent.

Database migrations. The process of evolving the database schema alongside the application code, with the safety and reversibility that production schema changes require.

Migration tooling: Flyway, Liquibase, Alembic (Python), sqlx-cli (Rust), EF Core migrations (C#), and the database-native tools for schema management. The migration file that captures each schema change with a timestamp or version number, the migration history table that records which migrations have been applied, and the migration runner that applies pending migrations in order.

Safe migration patterns: the schema changes that are safe to apply without downtime (adding nullable columns, adding indexes concurrently, adding tables, adding constraints that all existing data satisfies) versus the changes that require downtime or careful multi-step execution (removing columns — after all code references are removed, renaming columns — with a multi-step process that avoids breaking running application code, changing column types — with data transformation and compatibility verification).

Zero-downtime migrations: the multi-step approach for schema changes that must not cause downtime — the expand-contract pattern that adds the new schema element, migrates data, updates the application to use both old and new elements, then removes the old element in a subsequent migration once all instances are updated.

PostgreSQL-specific features. The capabilities beyond standard SQL that PostgreSQL provides and that production applications use.

JSONB: the binary JSON column type that allows storing semi-structured data with GIN indexing on JSON paths, containment operators, and the full Postgres JSON path query language. The JSONB column for attributes that vary between records, for storing third-party API responses, for configuration data — with the right GIN indexes for the access patterns the application uses.

Full-text search: PostgreSQL's built-in full-text search — the tsvector column that stores the preprocessed text representation, the tsquery that expresses search terms, the to_tsvector() and to_tsquery() functions, and the GIN index that makes full-text search fast. The full-text search that works for most application search requirements without requiring a separate Elasticsearch deployment.

TimescaleDB: the PostgreSQL extension that adds time-series capabilities — automatic time-based partitioning through hypertables, compression for historical time-series data, continuous aggregates that materialise aggregations over recent time windows, and the time-series specific query functions. TimescaleDB for applications that store high-volume time-series data alongside relational data — trading tick data, IoT sensor data, application metrics — in the same database without the operational overhead of a separate time-series database.

LISTEN/NOTIFY: PostgreSQL's pub/sub mechanism for notifying connected clients when database events occur — the trigger that sends a NOTIFY when a record is inserted or updated, and the application connection that LISTENs for those notifications to process changes in real time. The lightweight event mechanism for change notification without a separate message broker.

Table partitioning: RANGE, LIST, and HASH partitioning for tables that grow to sizes where full table operations become expensive — partitioning by date for time-series tables where old partitions are queried infrequently, partitioning by tenant ID for multi-tenant tables where per-tenant data volume is large.

Connection pooling and database operations. The configuration and infrastructure that makes the database layer perform reliably under load.

Connection pooling: PgBouncer for PostgreSQL connection pooling — the proxy that maintains a smaller pool of actual database connections that it reuses across a larger number of application connections. Session mode, transaction mode, and statement mode pooling — the tradeoffs between compatibility with session-level features (prepared statements, advisory locks, transaction state) and connection pool efficiency. The PgBouncer configuration that allows a database to serve many more concurrent application clients than the max_connections setting would otherwise permit.

Connection pool configuration in application code: the maximum pool size, the connection timeout, the idle connection lifetime — the pool settings that prevent connection exhaustion under peak load while not holding too many connections open during quiet periods.

Database monitoring: the pg_stat_statements extension for query-level performance statistics — the queries that consume the most total execution time, the queries with the highest average execution time, the queries with the highest call frequency. The monitoring that identifies performance problems before they become production incidents.

Redis for Caching and Real-Time Data

Redis is our primary in-memory data store for the use cases where a relational database is not the right fit — the data that needs to be accessed in microseconds rather than milliseconds, the data that expires after a defined time, the data that is computed once and cached to avoid repeated expensive computation.

Caching patterns. Cache-aside: the application checks Redis before querying the database, populates the cache on miss, and invalidates the cache on write. Write-through: the application writes to the cache and the database simultaneously, keeping the cache always current. The appropriate caching strategy for each data type based on how frequently it changes and how expensive it is to recompute.

Cache key design: structured cache keys that prevent collisions between different data types — user:{id}:profile, product:{id}:details, search:{hash}:results. The key naming convention that makes cache content identifiable and allows targeted cache invalidation.

TTL management: the expiry time that matches the data's freshness requirements — short TTLs for data that changes frequently, longer TTLs for data that changes rarely. The cache stampede prevention pattern that prevents multiple simultaneous misses from all hitting the database simultaneously.

Session storage. Server-side session storage in Redis — the session data stored in Redis with the session ID as the key, retrieved by the session middleware on every request. The TTL that expires sessions after the configured inactive period. The Redis session storage that works correctly across multiple application instances in a way that in-memory session storage cannot.

Real-time data structures. Redis data structures for real-time application features.

Sorted sets for leaderboards and rankings: the ZADD and ZRANGE operations that maintain ordered sets with scores — the user ranking, the high score list, the queue prioritised by urgency.

Pub/sub for real-time messaging: the Redis publish/subscribe that delivers messages to all subscribers of a channel — the real-time notification, the live update broadcast, the lightweight event bus for inter-service communication.

Lists as queues: LPUSH/BRPOP for a reliable work queue — the job pushed onto the queue by the producer and popped by the worker, with blocking pop that waits efficiently when the queue is empty.

HyperLogLog for cardinality estimation: approximate distinct count of large sets without storing all the values — the daily active user count, the unique visitor count, the distinct value count for a high-cardinality dimension.

Technologies Used

PostgreSQL — primary relational database for production applications
MySQL / MariaDB — relational database for applications requiring MySQL compatibility
Redis — in-memory data store for caching, session storage, and real-time data
TimescaleDB — PostgreSQL extension for time-series data
PgBouncer — PostgreSQL connection pooler
Flyway / Liquibase — database migration management (Java/multi-language projects)
Alembic — Python database migration management
sqlx — Rust async SQL with compile-time query verification
EF Core — C# ORM with code-first migrations
Prisma / Drizzle — TypeScript ORM and query builder
pg_stat_statements — PostgreSQL query performance monitoring
EXPLAIN ANALYZE — query execution plan analysis
AWS RDS / Aurora — managed database hosting on AWS
Docker — containerised database instances for development and testing

Database Engineering as Application Infrastructure

The database is not an implementation detail that can be addressed after the application is working. Schema design, indexing strategy, query design, transaction management, and migration planning are engineering decisions that shape the application's capabilities, performance envelope, and operational reliability. Getting these decisions right at the start of a project is significantly easier than correcting them in a production system that has accumulated data and deployed code that depends on the current schema.

Database engineering done correctly produces a data layer that serves the application's current needs efficiently, that can evolve safely as requirements change, and that does not become the source of the performance problems and data integrity issues that underengineered database layers produce over time.

Data Layers Built to Last

Database design, indexing, query optimisation, transaction management, and migration management — the engineering that produces data layers that are correct, fast, and safe to evolve over the application's lifetime.

Our Track Record

SemBricks has successfully delivered numerous projects using sql & database engineering. Our team continuously invests in staying current with the latest features and best practices.