Describing Data Migrations vs. Schema Migrations: What Your PR Needs to Know
As engineers, we spend a lot of time interacting with databases. And inevitably, those databases evolve. Whether you're adding a new feature or fixing a bug, you'll often find yourself needing to change the database structure or its contents. This is where migrations come in.
But not all migrations are created equal. There's a fundamental difference between changing the structure of your database (schema migration) and changing the data within it (data migration). Understanding this distinction isn't just academic; it profoundly impacts how you write your code, test your changes, and, crucially, how you describe them in a pull request. A well-crafted PR description for a migration can be the difference between a smooth deployment and a production nightmare.
Let's break down these two types of migrations and why they demand different approaches in your PRs.
What is a Schema Migration?
A schema migration involves altering the structure or definition of your database. You're changing the blueprint, not the contents of the house. These are typically DDL (Data Definition Language) operations.
Common schema migration examples include:
- Adding, dropping, or renaming tables.
- Adding, dropping, or modifying columns (e.g., changing data types, adding constraints like
NOT NULLorUNIQUE). - Creating, dropping, or modifying indexes.
- Adding or removing foreign key constraints.
- Changing default values for columns.
When you perform a schema migration, you're essentially telling the database system to reconfigure itself. The data already present in your tables might be affected by these structural changes (e.g., a new column might be NULL by default for existing rows), but the primary goal is to change the definition.
Implications for Your PR and Deployment
Schema migrations often go hand-in-hand with application code changes. For instance, if you add a new email_verified_at column to a users table, your application code will likely need to start writing to and reading from that column.
Key considerations for PRs involving schema migrations:
- Backward Compatibility: This is paramount. Can the old version of your application code still run successfully against the new schema, and vice-versa (if you need to roll back the code without rolling back the schema immediately)? Adding a
NULL-able column is usually backward-compatible; dropping a column or making oneNOT NULLoften is not. - Downtime: While modern database systems and careful planning can minimize this, some schema changes (especially large ones or those requiring table rewrites) can cause locking or temporary downtime, particularly in busy production environments.
- Rollback Strategy: Reverting a schema migration usually means running an inverse DDL operation. For example,
ALTER TABLE ADD COLUMNis reversed byALTER TABLE DROP COLUMN. This is generally straightforward, assuming no data was irrevocably lost.
Real-world Example: Adding a New Column in Rails
Let's say you're using Ruby on Rails and need to add a status column to an existing orders table.
# db/migrate/20231027100000_add_status_to_orders.rb
class AddStatusToOrders < ActiveRecord::Migration[7.0]
def change
add_column :orders, :status, :string, default: 'pending', null: false
end
end
This migration changes the orders table schema. In your PR description, you'd want to highlight:
- What: Adds a
statuscolumn toorders. - Why: To track the state of an order (e.g., 'pending', 'shipped', 'delivered').
- Backward Compatibility: The
default: 'pending', null: falseensures that existing rows get a default value, making it easier for older code to potentially interact without immediate crashes (though the old code won't know aboutstatus). New code will rely on this column. - Deployment: This migration would typically run before or during the deployment of the application code that uses the new
statuscolumn. - Test Plan: Verify that
Order.createcorrectly sets the default status, and that existing orders now have 'pending' status after the migration. Check that the application code using this new column functions as expected.
Pitfalls with Schema Migrations
- Blocking Operations:
ALTER TABLEstatements can acquire locks, potentially blocking reads/writes on large tables for extended periods. Tools likept-online-schema-change(for MySQL) orgh-ostare often used to mitigate this. - Non-nullable Columns: Adding a
NOT NULLcolumn without a default value to an existing table with data will fail unless you first add it asNULL-able, backfill data, and then add theNOT NULLconstraint. - Data Type Changes: Changing a column's data type (e.g.,
INTtoVARCHAR) can lead to data loss or truncation if not handled carefully, and can also be a blocking operation.
What is a Data Migration?
A data migration, on the other hand, involves altering the actual data within your database, without necessarily changing the schema. You're reorganizing the furniture and contents inside the house, not changing the house's structure. These are typically DML (Data Manipulation Language) operations.
Common data migration examples include:
- Populating a new column with values derived from existing data.
- Cleaning up inconsistent data (e.g., standardizing country codes, fixing typos).
- Merging duplicate records.
- Transforming data from one format to another within the same or different columns.
- Archiving old data by moving it to a separate table or flagging it.
Data migrations often occur after a schema migration (to backfill data into a new column) or independently (to correct data quality issues).
Implications for Your PR and Deployment
Data migrations are less about structural compatibility and more about data integrity, correctness, and performance.
Key considerations for PRs involving data migrations:
- Idempotency: Can the migration be run multiple times without causing incorrect results? This is crucial for safety and recovery. If your script updates records, running it again shouldn't re-update records that