Describing Data Migrations vs. Schema Migrations: What Your PR Needs to Know

As engineers, we spend a lot of time interacting with databases. And inevitably, those databases evolve. Whether you're adding a new feature or fixing a bug, you'll often find yourself needing to change the database structure or its contents. This is where migrations come in.

But not all migrations are created equal. There's a fundamental difference between changing the structure of your database (schema migration) and changing the data within it (data migration). Understanding this distinction isn't just academic; it profoundly impacts how you write your code, test your changes, and, crucially, how you describe them in a pull request. A well-crafted PR description for a migration can be the difference between a smooth deployment and a production nightmare.

Let's break down these two types of migrations and why they demand different approaches in your PRs.

What is a Schema Migration?

A schema migration involves altering the structure or definition of your database. You're changing the blueprint, not the contents of the house. These are typically DDL (Data Definition Language) operations.

Common schema migration examples include:

  • Adding, dropping, or renaming tables.
  • Adding, dropping, or modifying columns (e.g., changing data types, adding constraints like NOT NULL or UNIQUE).
  • Creating, dropping, or modifying indexes.
  • Adding or removing foreign key constraints.
  • Changing default values for columns.

When you perform a schema migration, you're essentially telling the database system to reconfigure itself. The data already present in your tables might be affected by these structural changes (e.g., a new column might be NULL by default for existing rows), but the primary goal is to change the definition.

Implications for Your PR and Deployment

Schema migrations often go hand-in-hand with application code changes. For instance, if you add a new email_verified_at column to a users table, your application code will likely need to start writing to and reading from that column.

Key considerations for PRs involving schema migrations:

  • Backward Compatibility: This is paramount. Can the old version of your application code still run successfully against the new schema, and vice-versa (if you need to roll back the code without rolling back the schema immediately)? Adding a NULL-able column is usually backward-compatible; dropping a column or making one NOT NULL often is not.
  • Downtime: While modern database systems and careful planning can minimize this, some schema changes (especially large ones or those requiring table rewrites) can cause locking or temporary downtime, particularly in busy production environments.
  • Rollback Strategy: Reverting a schema migration usually means running an inverse DDL operation. For example, ALTER TABLE ADD COLUMN is reversed by ALTER TABLE DROP COLUMN. This is generally straightforward, assuming no data was irrevocably lost.

Real-world Example: Adding a New Column in Rails

Let's say you're using Ruby on Rails and need to add a status column to an existing orders table.

# db/migrate/20231027100000_add_status_to_orders.rb
class AddStatusToOrders < ActiveRecord::Migration[7.0]
  def change
    add_column :orders, :status, :string, default: 'pending', null: false
  end
end

This migration changes the orders table schema. In your PR description, you'd want to highlight:

  • What: Adds a status column to orders.
  • Why: To track the state of an order (e.g., 'pending', 'shipped', 'delivered').
  • Backward Compatibility: The default: 'pending', null: false ensures that existing rows get a default value, making it easier for older code to potentially interact without immediate crashes (though the old code won't know about status). New code will rely on this column.
  • Deployment: This migration would typically run before or during the deployment of the application code that uses the new status column.
  • Test Plan: Verify that Order.create correctly sets the default status, and that existing orders now have 'pending' status after the migration. Check that the application code using this new column functions as expected.

Pitfalls with Schema Migrations

  • Blocking Operations: ALTER TABLE statements can acquire locks, potentially blocking reads/writes on large tables for extended periods. Tools like pt-online-schema-change (for MySQL) or gh-ost are often used to mitigate this.
  • Non-nullable Columns: Adding a NOT NULL column without a default value to an existing table with data will fail unless you first add it as NULL-able, backfill data, and then add the NOT NULL constraint.
  • Data Type Changes: Changing a column's data type (e.g., INT to VARCHAR) can lead to data loss or truncation if not handled carefully, and can also be a blocking operation.

What is a Data Migration?

A data migration, on the other hand, involves altering the actual data within your database, without necessarily changing the schema. You're reorganizing the furniture and contents inside the house, not changing the house's structure. These are typically DML (Data Manipulation Language) operations.

Common data migration examples include:

  • Populating a new column with values derived from existing data.
  • Cleaning up inconsistent data (e.g., standardizing country codes, fixing typos).
  • Merging duplicate records.
  • Transforming data from one format to another within the same or different columns.
  • Archiving old data by moving it to a separate table or flagging it.

Data migrations often occur after a schema migration (to backfill data into a new column) or independently (to correct data quality issues).

Implications for Your PR and Deployment

Data migrations are less about structural compatibility and more about data integrity, correctness, and performance.

Key considerations for PRs involving data migrations:

  • Idempotency: Can the migration be run multiple times without causing incorrect results? This is crucial for safety and recovery. If your script updates records, running it again shouldn't re-update records that