tweeting about DB phone number formats
Update: 06-27-2017, See the update & round 2 section below for improved and corrected data
After some frustration cleaning up some of our information architecture related to phone numbers. I posted the tweet at the top of this article.
It actually lead to far more responses, questions, and discussion about the best way to store numbers… It made me think that perhaps our quick attempts to extract the best process out of multiple production table cleanups, could be missing something. I decided it would be pretty easy to try to dig in a bit deeper for a more data backed answer. I definitely don’t claim to be a DB performance expert, so perhaps we had reached some wrong conclusions.
a portion of the sub-discussion from the original tweet
a pretty strong statement saying we got this wrong
The premise we ended up with was that it was worth a tradeoff of some complexity to store phone numbers as bigint
vs varchar
in our Postgres DB.
Everyone admits it is easier to work with a string number and it avoids the leading 0
problem. Is there a valid reason to use bigint
vs varchar
?
The thread brought up a few open questions:
OK, we should be able to answer these fairly easily.
Create Simple Tables with our unique index
class DemoTables < ActiveRecord::Migration[5.1]
def change
create_table :demo_int_phone_numbers do |t|
t.bigint :phone_number, limit: 8, null: false
end
add_index :demo_int_phone_numbers, [:phone_number], unique: true
create_table :demo_string_phone_numbers do |t|
t.string :phone_number, limit: 8, null: false
end
add_index :demo_string_phone_numbers, [:phone_number], unique: true
end
end
Create our simple AR models
class DemoIntPhoneNumber < ActiveRecord::Base
end
class DemoStringPhoneNumber < ActiveRecord::Base
end
Build a simple benchmark script
namespace :demo do
task :perf => :environment do
DemoStringPhoneNumber.destroy_all
DemoIntPhoneNumber.destroy_all
GC.start
puts "benchmarking insertions"
Benchmark.bm(7) do |bm|
bm.report do
(0...200_000).each do |val|
DemoStringPhoneNumber.create!(phone_number: val)
end
end
end
GC.start
Benchmark.bm(7) do |bm|
bm.report do
(0...200_000).each do |val|
DemoIntPhoneNumber.create!(phone_number: val)
end
end
end
GC.start
puts "benchmarking queries"
Benchmark.bm(7) do |bm|
bm.report do
(0...200_000).each do |val|
DemoStringPhoneNumber.where(phone_number: val).first
end
end
end
GC.start
Benchmark.bm(7) do |bm|
bm.report do
(0...1200_000).each do |val|
DemoIntPhoneNumber.where(phone_number: val).first
end
end
end
end
end
I was wrong on table size… Looking back as we moved phone numbers from many tables to a single phone_numbers
table a number of them were using different types, including varchar
, which did make the tables bigger… but that was because a number of the old tables hadn’t set a limit meaning it defaulted to varchar(255)
. Once you set a limit bigint
and string
with limit: 8
will result in the same size of table.
select pg_size_pretty(pg_relation_size('demo_int_phone_numbers'));
=> 8656 kB
select pg_size_pretty(pg_relation_size('demo_string_phone_numbers'));
=> 8656 kB
I actually didn’t expect a difference on this but someone mentioned there could be a slower insertion time… It looks like nothing noticeable on insertion which is nice.
# benchmarking insertions
user system total real
385.130000 35.700000 420.830000 (568.695180)
user system total real
371.370000 34.110000 405.480000 (543.861259)
While numerous performance blog posts say that a Postgres btree
index is more performant for integers than strings. Sample data doesn’t show as big of a difference as I was expecting. The performance benefit is questionable. It is there, but doesn’t seem worth the effort. Since this is a quick and dirty benchmark one would have to be more systematic to prove out a real difference. I think my simplification of just having sequential numbers inserted could not realistically simulate how a production index ends up.
# 200,000 queries against String Table
user system total real
194.320000 15.400000 209.720000 (265.271056)
# 200,000 queries against Int Table
user system total real
160.390000 13.010000 173.400000 (210.774260)
In that end this does show a small difference, but nothing to be impressed about.
From our cleanup of multiple tables we had multiple issues and conditions which lead me to conflate some of the issues.
varchar(255)
opposed to using a correct limit. This entirely counts for the table size difference.varchar
field can still impact query performance, I am not sure about this.validates :field_name uniqueness: true
, and sometimes not even enforcing uniqueness. Which I think could impact query time more significantly when there are duplicates. Proving this out would require more benchmarking.Let’s revisit our questions:
Is there a valid reason to use
bigint
vsvarchar
?
Not really, there may be slight query performance wins, but far less than I thought.
What is the difference in DB table storage size?
None, I was wrong
What is the performance difference on insertion time?
None, which I expected but some other folks expected to see.
What is the performance difference on query time?
Slight difference, more detailed benchmarking would be good… Initial results show this to be far less significant than initially thought. I will say this one is dependent on the performance characteristics you see in production. That being said my assumptions here were still a bit flawed, and I was surprised by the results.
OK so this post made sense to folks, but then people were asking why I am limiting the phone number to 8, which wouldn’t work in various countries… in fact it wouldn’t work for our countries either… I will need to rerun my benchmarks and I believe get different results.
We have historically put limit in our int fields when doing something other than the default. The limit has a different meaning for numerical and string types for Rails DB migrations.
The :limit
on bigint
or int
in Rails refers to the byte size not a length limit. bigint limit: 8
is redundant as 8 bytes is the max. Prior to Rails supporting the bigint
type, the way to get a bigint
was to actually use int limit: 8
. In fact if you look at the schema after creating a bigint limit: 6
the limit is just ignored by postgres (and by mysql it seems). The DB storage type ends up as a bigint with 8 bytes of storage. This post explains the oddness of int limits well.
Limit on string dos what you expect and limits the length of the string. The way I was generating phone numbers never tested this as I simply generated between (0...200_000)
.
# storing a typical TZ number for us
DemoIntPhoneNumber.create!(phone_number: 2551234567890)
# storing an actual max, which is far larger than we need
DemoIntPhoneNumber.create!(phone_number: 9223372036854775807)
# lets try those with our string limit version
DemoStringPhoneNumber.create!(phone_number: 2550677939180)
ActiveRecord::ValueTooLong: PG::StringDataRightTruncation: ERROR: value too long for type character varying(8)
# and clearly the max won't work at all...
My benchmark isn’t valid as varchar(8)
isn’t really the same as bigint
. The fact that tables size was previously the same was a bad coincidence, which I think shows that bigint
is actually more efficient.
Let’s try again, varchar actually needs to hold 13 for our max length number, and 15 for German numbers. So we will compare bigint
to varchar(15)
, even though bigint
can technically hold numbers of length 19.
Setup Changes
Let’s create two new tables the first verifies with the type bigint
that the limit
option is ignored it just creates a table with an 8 byte bigint
Postgres type. The second creates a new string table with a varchar(15)
class DemoTablesTwo < ActiveRecord::Migration[5.1]
def change
create_table :demo_smallint_phone_numbers do |t|
t.bigint :phone_number, limit: 6, null: false
end
add_index :demo_smallint_phone_numbers, [:phone_number], unique: true
create_table :demo_bigstring_phone_numbers do |t|
t.string :phone_number, limit: 15, null: false
end
add_index :demo_bigstring_phone_numbers, [:phone_number], unique: true
end
end
OK then we will need a quick AR class
class DemoBigstringPhoneNumber < ActiveRecord::Base
end
Then the benchmark improvements
namespace :demo do
task :perf => :environment do
DemoBigstringPhoneNumber.delete_all
DemoIntPhoneNumber.delete_all
base_number = 2550000000000
max_number = (base_number + 150_000)
GC.start
puts "benchmarking insertions"
Benchmark.bm(7) do |bm|
bm.report do
(base_number...max_number).each do |val|
DemoBigstringPhoneNumber.create!(phone_number: val)
end
end
end
GC.start
Benchmark.bm(7) do |bm|
bm.report do
(base_number...max_number).each do |val|
DemoIntPhoneNumber.create!(phone_number: val)
end
end
end
# lets not just have sequential numbers lets get some better data by inserting arbitrary numbers
(75_000).times do
val = max_number + rand(90_000_000)
begin
DemoBigstringPhoneNumber.create!(phone_number: val)
DemoIntPhoneNumber.create!(phone_number: val)
rescue => err
# ignore duplicates that occur because of rand
puts err
end
end
GC.start
puts "benchmarking queries"
Benchmark.bm(7) do |bm|
bm.report do
DemoBigstringPhoneNumber.find_each do |val|
DemoBigstringPhoneNumber.where(phone_number: val.phone_number).first
end
end
end
GC.start
Benchmark.bm(7) do |bm|
bm.report do
DemoIntPhoneNumber.find_each do |val|
DemoIntPhoneNumber.where(phone_number: val.phone_number).first
end
end
end
end
end
After running a number of tests, the tables size issue does become more clear. This now matches my original expectation.
select count(*) from demo_int_phone_numbers; => 339669
select pg_size_pretty(pg_relation_size('demo_int_phone_numbers'));
=> 14 MB
select count(*) from demo_bigstring_phone_numbers; => 339669
select pg_size_pretty(pg_relation_size('demo_bigstring_phone_numbers'));
=> 17 MB
Again no real difference.
Results are are starting to show a bit more more of a difference. Likely, more valid as this time we aren’t just querying sequential numbers but we have some randomness introduced.
# queries against String Table varchar(15)
user system total real
377.940000 28.550000 406.490000 (550.554212)
# queries against Int Table
user system total real
302.260000 23.320000 325.580000 (403.738332)
Let’s revisit our questions:
Is there a valid reason to use
bigint
vsvarchar
?
Perhaps, the query improvements are smaller than I thought, but noticeable. The table size difference is real, but also not that large. It would really depend on your needs and the size of your tables. In the end Google says never store phone numbers as numeric data (rule #24), but given a known set of countries and query performance, you can examine the tradeoffs.
What is the difference in DB table storage size?
Yes, there is a difference. The type bigint
is a more efficient way to store 15 digit numbers.
What is the performance difference on insertion time?
None, which I expected but some other folks expected to see.
What is the performance difference on query time?
The larger set of test data and introducing randomness into the data seems to show an increasing improvement for bigint
queries. While it is still small than I anticipated, it does seem to have a real impact.