Database modelling is the systematic process of defining data structures, constraints, and relationships to represent real-world information. It is executed in three distinct, progressive phases to ensure alignment between business requirements and technical implementation:
This is the highest level of abstraction, focusing purely on business entities and their relationships without technical details. It identifies what data is required (e.g., "Customer," "Order") and is designed for communication with non-technical stakeholders. It deliberately delays decisions regarding database platforms or storage mechanisms.
this phase refines the conceptual model by defining the structure of data elements (attributes), keys, and normalization rules (typically up to 3NF). It introduces specific data types (generic strings, integers) and explicit business rules (cardinality, optionality) to validate integrity before engineering begins.
It translates logical entities into tables, defines specific column data types, creates indexes for performance, and handles storage parameters, partitioning, and security constraints.
While a "join" logically combines rows from two or more tables based on a related column, the database engine executes this using specific algorithms determined by data size, sorting, and available indexes: ideal for small datasets or when one table is significantly smaller than the other. It iterates through every row of the outer table and compares it against every row of the inner table. Its complexity is roughly proportional to the product of the row counts ( O(N×M) ), making it inefficient for large, unindexed sets.
Optimized for large, unsorted datasets with equality conditions (=). The engine performs a build phase by creating an in-memory hash table from the smaller table, then a probe phase where it scans the larger table and looks up matches in the hash table. If the hash table exceeds memory , it spills to disk, partitioning the data to process subsets.
Highly efficient for large datasets that are already sorted on the join keys. The engine simultaneously scans both tables in order, matching rows as it progresses. It avoids the memory overhead of hashing but requires the input data to be sorted, potentially adding a sorting cost if indexes are absent.
Relationships are governed by structural constraints that define exactly how entities interact, going beyond simple connections to enforce business logic:
Cardinality Ratio: Specifies the maximum number of relationship instances an entity can participate in.
One-to-One (1:1): An instance in Table A relates to at most one instance in Table B.
One-to-Many (1:N): An instance in Table A relates to many instances in Table B.
Many-to-Many (M:N): Instances in both tables can relate to multiple instances in the other, requiring a junction table to resolve.
Participation Constraint (Optionality): Specifies the minimum number of relationship instances required, determining if an entity's existence depends on the relationship.
Total Participation (Mandatory): Every entity instance must participate in the relationship (denoted by a double line in ER diagrams). Example: Every Order must have a Customer.
Partial Participation (Optional): Entity instances may participate but are not required to (denoted by a single line). Example: A Customer may exist without placing an Order.
Degree of Relationship: The number of entity sets involved in a relationship (e.g., binary involves two, ternary involves three). Recursive relationships occur when an entity relates to itself (e.g., an Employee supervises other Employees).
A schema is not merely a collection of tables but a multi-layered definition of data organization, formally described by the ANSI-SPARC three-schema architecture to ensure data independence:
representing the user-specific view. It defines only the data relevant to a particular user or application, hiding the rest of the database for security and simplicity.
It describes what data is stored, the relationships, constraints, and semantics, independent of physical storage details. It acts as the intermediary between external views and internal storage.
describes how data is physically stored on the storage medium. It defines storage structures, access paths (indexes), data compression, encryption, and record placement.
** Data Independence** is the core benefit of this architecture:
- Logical Data Independence: The ability to change the Conceptual Schema (e.g., adding a column) without affecting External Schemas or application programs.
- Physical Data Independence: The ability to change the Internal Schema (e.g., adding an index or changing file organization) without affecting the Conceptual Schema.