Optimising Ruby on Rails database query
ActiveRecord makes it easy to write queries but understanding how it works behind the hood is important for performance.
I was asked to investigate a very slow RoR application and determined that it was due to the way the ActiveRecord was written. These were the 2 models causing the problems:
# app/models/event.rb
class Event < ApplicationRecord
scope :final, -> { where final: true }
validates :event_number, uniqueness: { scope: :country_id }
belongs_to :country
has_many :races
has_many :trials
has_many :race_timings
def calculate_completion runner
pids = races.where.not run_type: 'trial'
best_timings = pids.pluck(:id).map { |pid| runner.get_best_time nil, rid, false }
best_timings << runner.get_best_time(id, nil, true)
best_timings = best_timings.map { |e| e ? e.time : 0.0 }
best_timings = best_timings.map { |e| !e.nil? && e>1.0 ? 1.0 : e }
pids.count==0 || runner.nil? ? 0.0 : (100.0 * best_timings.compact.sum / best_timings.count.to_f).ceil
end
end
# app/models/runner.rb
class Runner < ApplicationRecord
has_many :trials
has_many :race_timings
validates :uid, uniqueness: true
validates :email, uniqueness: true
def get_best_time event_id, race_id, trial
s = race_timings.where event_id: event_id, trial: true, best_time: true if trial==true
s = race_timings.where race_id: race_id, trial: false, best_time: true if trial==false
s.first
end
end
It seems too easy to write database queries in Ruby on Rails, that some developers forgot about how it works behind the hood. The above code were taking approximately 28 seconds to complete. I've made some comments in the code to explain why:
def calculate_completion runner
# At this point, the query is not made yet.
pids = races.where.not run_type: 'trial'
# Inefficient 1: the .pluck() method triggers the query but the result is passed immediately to .map() method.
# Inefficient 2: each id then triggers a call to runner.get_best_time().
best_timings = pids.pluck(:id).map { |pid| runner.get_best_time nil, rid, false }
best_timings << runner.get_best_time(id, nil, true)
best_timings = best_timings.map { |e| e ? e.time : 0.0 }
best_timings = best_timings.map { |e| !e.nil? && e>1.0 ? 1.0 : e }
# Inefficient 3: since .pluck() result was retained, the .count() method triggers another query.
pids.count==0 || runner.nil? ? 0.0 : (100.0 * best_timings.compact.sum / best_timings.count.to_f).ceil
end
It should be noted that the query is not called until some trigger. The method
.pluck()
is that trigger which then returns an array and is immediately chained to the method.map()
. This means the result is not retained for use in the last method.count()
.Within each
.map()
there were multiple calls torunner.get_best_time()
.
I re-wrote the methods:
class Event < ApplicationRecord
scope :final, -> { where final: true }
validates :event_number, uniqueness: { scope: :country_id }
belongs_to :country
has_many :races
has_many :trials
has_many :race_timings
def calculate_completion runner
# No change.
end
# Added new method for optimization.
def calculate_completion_v2 runner
# The .pluck() method is called and results retained.
pids = races.where.not(run_type: 'trial').pluck(:id)
# A single combined query instead.
best_timings = runner.get_best_times(id, pids).to_a
# No change to calculation.
best_timings << runner.get_best_time(id, nil, true)
best_timings = best_timings.map { |e| e ? e.time : 0.0 }
best_timings = best_timings.map { |e| !e.nil? && e>1.0 ? 1.0 : e }
# The new multiple queries does not return results for some pids, thus have to pad it.
([pids.length + 1 - best_timings.length, 0].max).times.each{|i| best_timings << 0}
# Using .length instead of .count() method.
pids.length==0 || runner.nil? ? 0.0 : (100.0 * best_timings.compact.sum / best_timings.length.to_f).ceil
end
end
class Runner < ApplicationRecord
has_many :trials
has_many :race_timings
validates :uid, uniqueness: true
validates :email, uniqueness: true
def get_best_time event_id, race_id, trial
s = race_timings.where event_id: event_id, trial: true, best_time: true if trial==true
s = race_timings.where race_id: race_id, trial: false, best_time: true if trial==false
s.first
end
# Added a new method for combining queries.
def get_best_times event_id, race_ids
# Using a PostgreSQL specific syntax to combine the queries.
race_timings.select('distinct on(race_id) *').where(race_id: race_ids, trial: false, best_time: true).order(:race_id, :id)
end
end
The new code now takes about 8 seconds. The next optimisation is to create indexes:
CREATE INDEX race_timings_runner_best_time_idx ON public.race_timings
(
runner_id,
best_time
);
Now it is less than 3 seconds.