As you work with a growing team and codebase, occasionally database changes will cause some surprises and issues during deploys. This seems to be especially true in the Rails world, but isn’t exclusively a Rails app problem. Rails has a simple way to handle schema changes, which works well for small additive changes, but struggles on larger changes to an existing DB.
As I have worked with a few teams, it is worth calling out some issues and learnings about how to help avoid some issues.
At Offgrid Electric, we have shared github wikis with documentation, process, and policies. Over time the team has built up processes related to making database changes to our application. These policies help us avoid some off the pitfalls, while encouraging good communication. We adjust them over time, but it might be helpful for other organizations as they begin to run into any issues around database change process. Below is a partial scrubbed version of our documentation.
Changes to the data model (adding/removing tables/columns) should involve active communication with the data team to ensure:
The following processes should help us ensure these points.
data-infrastructure
team prior to merging.Avoid adding columns with a json
data type. Our data warehouse uses AWS Redshift, which does not have robust support for JSON column types. Which means that any data that starts off as json
in our DB becomes, incredibly hard to query. Since it’s hard to know what we will want to query in the data warehouse, it’s safest to assume “all of it.” If that’s the case, then we should strongly avoid json
column types in Surge.
Creating an index on a large table can be slow, if you follow the process recommended here, it won’t lock the tables during the indexing process.
Modifying Large Tables
process.postgres concurrent indexing in Rails
Why we need a process to modify large tables. This isn’t handled well by automatically running Rails migrations on deploy.
Solution:
To avoid this we commit the migration but run it manually from a console.
_
to it won’t run automatically
db/migrate/_20160817125700_add_transaction_date.rb
vs db/migrate/20160817125700_add_transaction_date.rb
_
mv db/migrate/_20160817125700_add_transaction_date.rb db/migrate/20160817125700_add_transaction_date.rb
bundle exec rake db:migrate
Alternate Complex Solution:
Rarely there is a reason to bypass migrations entirely and make a change manually against the DB. In these special cases which normally means a DBA is going to perform some magic to migrate something without locking, the process is slightly different.
Basically the same as above BUT:
_
naming part of the process mentioned above.class AddTransactionDate < ActiveRecord::Migration
def up
unless Rails.env.production?
add_column :account_transactions, :transaction_date, :datetime
execute "UPDATE account_transactions SET transaction_date = created_at"
...
end
end
ActiveRecord::Base.connection.execute("INSERT INTO schema_migrations (version) VALUES('20130817125700')")
unless Rails.env.production?
should also protect against the migration accidentally running.Data migrations often can take far over 10 minutes, if you are iterating through and modifying a large table we have had some that take hours. This isn’t a good first for the standard Rails migration or standard deployment processes. So we looked for some other options
We previously ran release tasks, kind of following the thoughtbot process. These are basically managed rake tasks, but occasionally folks weren’t sure what had been run by whom and when. While this worked OK, we thought we could do better.
Our team ended up creating rails-data-migrations which is a lightweight wrapper around standard Rails migrations to create data migrations. These use the same mechanism to store the timestamp into the schema_migrations
as a standard migration, but they don’t run on default db:migrate
. The data migrations are included in their own db/data_migrations
and include a generator to create new migrations (rails generate data_migration migration_name
).