Assuming you have ruby and Bundler going, you’ll need a Gemfile to include db_subsetter. We’re working against MySQL for this example, so we’ll load that gem too.

#Gemfile
source "https://rubygems.org"

gem "db_subsetter"
gem "mysql2"

This is a minimal export script to export a subset and see what we get.

#!/usr/bin/env ruby
require 'db_subsetter'

ActiveRecord::Base.establish_connection(
        adapter:  "mysql2",
        host:      "127.0.0.1",
        username:  "joe",
        database:  "big_db"
  )

exporter = DbSubsetter::Exporter.new
exporter.export("demo-export.sqlite3")

Upon running, db_subsetter will verify exportability on all tables and let us know what problems it finds.

$ bundle exec ruby ./export.rb
Verifying table exportability ...

Verifying: blog_articles
Verifying: campaign_tracker_campaigns
Verifying: campaign_tracker_page_views
Verifying: campaign_tracker_visitors
Verifying: charges
Verifying: components
Verifying: customer_types
Verifying: customers
Verifying: departments
Verifying: features
Verifying: invoice_statuses
Verifying: invoices
Verifying: number_maps
Verifying: pages
Verifying: permissions
Verifying: picture_mappings
Verifying: pictures
Verifying: products
Verifying: schema_migrations
Verifying: sessions
Verifying: states
Verifying: videos
ERROR: Too many rows in: campaign_tracker_page_views (3825837)
ERROR: Too many rows in: campaign_tracker_visitors (1233616)
ERROR: Too many rows in: charges (141628)
ERROR: Too many rows in: customers (59351)
ERROR: Too many rows in: invoices (103001)
<snip>/exporter.rb:35:in `verify_exportability': Some tables are not exportable (ArgumentError)
        from <snip>/lib/db_subsetter/exporter.rb:42:in `export'
        from ./export.rb:18:in `<main>'

So let’s start a quick subset filter to get this under control. First things first, let’s just ignore the campaign tracking tables completely, we don’t need that for development.

# demo_subset_filter.rb
class DemoSubsetFilter < DbSubsetter::Filter
  def ignore_tables
    %w( campaign_tracker_page_views campaign_tracker_visitors )
  end
end

# export.rb
#!/usr/bin/env ruby
require 'db_subsetter'
require './demo_subset_filter'

ActiveRecord::Base.establish_connection(
        adapter:  "mysql2",
        host:      "127.0.0.1",
        username:  "joe",
        database:  "big_db"
  )

exporter = DbSubsetter::Exporter.new
exporter.filter = DemoSubsetFilter.new # <- Configure our new filter
exporter.export("demo-export.sqlite3")

Let’s run that and see what we get now:

$ bundle exec ruby ./export.rb
Verifying table exportability ...

Verifying: blog_articles
Verifying: campaign_tracker_campaigns
Verifying: charges
Verifying: components
Verifying: customer_types
Verifying: customers
Verifying: departments
Verifying: features
Verifying: invoice_statuses
Verifying: invoices
Verifying: number_maps
Verifying: pages
Verifying: permissions
Verifying: picture_mappings
Verifying: pictures
Verifying: products
Verifying: schema_migrations
Verifying: sessions
Verifying: states
Verifying: videos
ERROR: Too many rows in: charges (141628)
ERROR: Too many rows in: customers (59351)
ERROR: Too many rows in: invoices (103001)
<snip>/exporter.rb:35:in `verify_exportability': Some tables are not exportable (ArgumentError)
        from <snip>/lib/db_subsetter/exporter.rb:42:in `export'
        from ./export.rb:14:in `<main>'

Great! The two giant tables don’t show up at all anymore. Now let’s get a handle on the big tables.

# demo_subset_filter.rb
class Customer < ActiveRecord::Base; end

class DemoSubsetFilter < DbSubsetter::Filter
  def customer_ids
    Customer.order(:id => :desc).limit(1000).pluck(:id)
  end
  
  def filter_customers(query)
    query.where(query[:id].in(customer_ids))
  end
end

Which runs and removes the error on customers.

$ bundle exec ruby ./export.rb
Verifying table exportability ...

Verifying: blog_articles
Verifying: campaign_tracker_campaigns
Verifying: charges
Verifying: components
Verifying: customer_types
Verifying: customers
Verifying: departments
Verifying: features
Verifying: invoice_statuses
Verifying: invoices
Verifying: number_maps
Verifying: pages
Verifying: permissions
Verifying: picture_mappings
Verifying: pictures
Verifying: products
Verifying: schema_migrations
Verifying: sessions
Verifying: states
Verifying: videos
ERROR: Too many rows in: charges (141628)
ERROR: Too many rows in: invoices (103001)
<snip>/exporter.rb:35:in `verify_exportability': Some tables are not exportable (ArgumentError)
        from <snip>/lib/db_subsetter/exporter.rb:42:in `export'
        from ./export.rb:14:in `<main>'

Excactly what we expect. So now let’s do the other two tables:

# demo_subset_filter.rb
class Customer < ActiveRecord::Base; end
class Invoice < ActiveRecord::Bas; end
class Charge < ActiveRecord::Base; end

class DemoSubsetFilter < DbSubsetter::Filter
  def customer_ids
    Customer.order(:id => :desc).limit(1000).pluck(:id)
  end
  
  def invoice_ids
    Invoice.where(:customer_id => customer_ids).pluck(:id)
  end
  
  def charge_ids
    Charge.where(:invoice_id => invoice_ids).pluck(:id)
  end

  def filter_customers(query)
    query.where(query[:id].in(customer_ids))
  end

  def filter_invoices(query)
    query.where(query[:id].in(invoice_ids))
  end
  
  def filter_charges(query)
    query.where(query[:id].in(charge_ids))
  end

  def ignore_tables
    %w( campaign_tracker_page_views campaign_tracker_visitors )
  end
end

And run that:

$ bundle exec ruby ./export.rb
Verifying table exportability ...

Verifying: blog_articles
Verifying: campaign_tracker_campaigns
Verifying: charges
Verifying: components
Verifying: customer_types
Verifying: customers
Verifying: departments
Verifying: features
Verifying: invoice_statuses
Verifying: invoices
Verifying: number_maps
Verifying: pages
Verifying: permissions
Verifying: picture_mappings
Verifying: pictures
Verifying: products
Verifying: schema_migrations
Verifying: sessions
Verifying: states
Verifying: videos


Exporting data...

Exporting: blog_articles (1 pages).
Exporting: campaign_tracker_campaigns (1 pages).
Exporting: charges (1 pages).
Exporting: components (1 pages).
Exporting: customer_types (1 pages).
Exporting: customers (1 pages).
Exporting: departments (1 pages).
Exporting: features (1 pages).
Exporting: invoice_statuses (1 pages).
Exporting: invoices (1 pages).
Exporting: number_maps (0 pages)
Exporting: pages (1 pages).
Exporting: permissions (1 pages).
Exporting: picture_mappings (1 pages).
Exporting: pictures (1 pages).
Exporting: products (1 pages).
Exporting: schema_migrations (1 pages).
Exporting: sessions (0 pages)
Exporting: states (1 pages).
Exporting: videos (1 pages).

Success! Now to show off some more functionality, let’s say we want to only export active products and their associated charges.

class Customer < ActiveRecord::Base; end
class Invoice < ActiveRecord::Base; end
class Charge < ActiveRecord::Base; end
class Product < ActiveRecord::Base; end

class DemoSubsetFilter < DbSubsetter::Filter
  def customer_ids
    Customer.order(:id => :desc).limit(1000).pluck(:id)
  end
  
  def invoice_ids
    Invoice.where(:customer_id => customer_ids).pluck(:id)
  end
  
  def charge_ids
    Charge.where(:invoice_id => invoice_ids).pluck(:id)
  end

  def product_ids
    Product.where(:discontinued => false).limit(500).pluck(:id)
  end

  def filter_customers(query)
    query.where(query[:id].in(customer_ids))
  end

  def filter_invoices(query)
    query.where(query[:id].in(invoice_ids))
  end
  
  def filter_charges(query)
    query.where(query[:id].in(charge_ids)).where(query[:product_id].in(product_ids))
  end

  def ignore_tables
    %w( campaign_tracker_page_views campaign_tracker_visitors )
  end
end

Sometimes you’re going to want more than the default limit of rows. We can override that simply. I would like to make this more granular so you can set it really high on a few tables but leave it low elsewhere to keep data sizes from creeping up.

#!/usr/bin/env ruby
require 'db_subsetter'
require './demo_subset_filter5'

ActiveRecord::Base.establish_connection(
        adapter:  "mysql2",
        host:      "127.0.0.1",
        username:  "joe",
        database:  "big_db"
  )

exporter = DbSubsetter::Exporter.new
exporter.filter = DemoSubsetFilter.new
exporter.max_filtered_rows = 5000
exporter.export("demo-export.sqlite3")

Now it’d be awful handy to reset everybody’s password at export time and assign them a random name to protect their privacy.

This API is weak - I’m still not sure what it should look like. Expect this to change.

# demo_subset_scrambler.rb
class DemoSubsetScrambler < DbSubsetter::Scrambler
  def scramble_customers(row)
    scramble_column(:customers, :password, row, 'hashthisfirst')
    row
  end
end

# export.rb
#!/usr/bin/env ruby
require 'db_subsetter'
require './demo_subset_filter6'
require './demo_subset_scrambler6'


ActiveRecord::Base.establish_connection(
        adapter:  "mysql2",
        host:      "127.0.0.1",
        username:  "joe",
        database:  "big_db"
  )

exporter = DbSubsetter::Exporter.new
exporter.filter = DemoSubsetFilter.new
exporter.add_scrambler DemoSubsetScrambler.new # <- tell our exporter
exporter.max_filtered_rows = 5000
exporter.export("demo-export.sqlite3")

We also realize it’s horrible practice to have customer PII on developer laptops, so let’s give everybody a random name:

# demo_subset_scrambler.rb
class DemoSubsetScrambler < DbSubsetter::Scrambler
  def scramble_customers(row)
    new_name = RandomWord.adjs.next.capitalize + " " + RandomWord.nouns.next.capitalize
    scramble_column(:customers, :name, row, new_name )

    scramble_column(:customers, :password, row, 'hashthisfirst')
    row
  end
end

To import we need to setup an importer just like our exporter.

#!/usr/bin/env ruby
require 'db_subsetter'

ActiveRecord::Base.establish_connection(
        adapter:  "mysql2",
        host:      "127.0.0.1",
        username:  "joe",
        database:  "little_db"
  )


importer = DbSubsetter::Importer.new("demo-export.sqlite3", DbSubsetter::Dialect::MySQL)
importer.import

We run that and check to check on the data.


mysql> select name, password from customers limit 10;
+----------------------------+---------------+
| name                       | password      |
+----------------------------+---------------+
| Anoperineal Charadrii      | hashthisfirst |
| Branchless Lace_making     | hashthisfirst |
| Viscid Pitressin           | hashthisfirst |
| Daughterly Ice_rink        | hashthisfirst |
| Elect Whey                 | hashthisfirst |
| Saccadic Showpiece         | hashthisfirst |
| Faux Blinks                | hashthisfirst |
| Investigative Instilment   | hashthisfirst |
| Haywire Adansonia_gregorii | hashthisfirst |
| Grumous Pushup             | hashthisfirst |
+----------------------------+---------------+
10 rows in set (0.00 sec)


Bonus feature. We get a report of a production user with problems we can’t easily replicate. Let’s grab that user as part of the subset.

# demo_subset_filter.rb

  def customer_ids
    Customer.order(:id => :desc).limit(1000).pluck(:id) +
    Customer.where(:email => 'problem@user.com').pluck(:id)
  end