Optimizing Solr and Rails – Index in the background

Plugins, Ruby on Rails 11 Comments »

Update: 2008-02-21 We’re looking into using ActiveMessaging and Amazon SQS to help with the workflow for background processing. Stay tuned for an updated post.

With before_save and after_save filters being so easy to use, it’s tempting to add more and more pre and post-processing to saving an ActiveRecord model. For Obsidian Portal, we update permissions, set timestamps of associated objects, and do all sorts of stuff. Unfortunately, all this extra work takes time, and can significantly slow down your application. The more work you do on the main execution thread, the more time Mongrel is tied up doing stuff unrelated to servicing requests. If something goes wrong in any of the filters, Rails will rollback the database transaction, and *poof* it’s all gone!

A while back, we started seeing ‘rbuf_fill’ timeout errors in the server logs. From what we could see, calls to acts_as_solr indexing were timing out, interrupting the save. For us, this was really bad. People would spend lots of time painstakingly crafting their perfect blog posting or wiki page, only to have it evaporate into nothing. All they saw was our default “Internal Server Error” page. Sure, it looks nice, but no one wants to see that ;)

Tracing the timeout back to Solr was not hard, and the solution was clear: take the indexing out of the main execution thread and move it to a background process. Luckily, acts_as_solr made this a fairly easy refactoring process. Here’s what we did:

Add an :if clause to your acts_as_solr macro call

acts_as_solr supports an :if clause that will be used to determine whether or not the record will be indexed when save is called. We want this to always evaluate to false, except when we explicitly set it to true during off-line processing. Below is an example from one of our models:

acts_as_solr :fields => [:name, :body, :post_title, :post_tagline, :slug],
:if => :solr_index?

def solr_index?
@solr_index
end
attr_writer :solr_index

Use rake/cron to do the indexing in the background.

Now that indexing does not happen on save, we need to make sure it happens at some point. Our solution was to move it to a rake task that gets executed by a periodic cron job. Rake + cron has worked well for us in the past, so we’ll stick with it.

The task itself is very simple. Find all the objects that have been updated since the last indexing, and push them to Solr.

Below is the rake task that I wrote. If I were more clever, I would probably come up with a neat trick for automatically finding all the models that support Solr indexing. Now that I’m an official committer on acts_as_solr, maybe I’ll try to figure something out and get it into the trunk. Still…I’m lazy :)

namespace :solr do
namespace :index do
desc “Indexes campaigns”
task :campaigns => :environment do
index_class(Campaign)
end

desc “Indexes wiki pages”
task :wiki_pages => :environment do
index_class(WikiPage)
end

desc “Indexes game contents”
task :game_contents => :environment do
index_class(GameContent)
end

desc “Indexes users”
task :users => :environment do
index_class(User)
end

desc “Indexes everything that we’re storing in solr”
task :all => [:campaigns, :wiki_pages, :game_contents, :users]

def index_class(klass)
# If REBUILD is set to “true” then we rebuild the entire index
rebuild = ENV["REBUILD"] ? ENV["REBUILD"] == “true” : false

interval = rebuild ? 100.years : 30.minutes

objects = klass.find(:all,
:conditions => ["updated_at > ?", Time.now - interval],
:page => {:size => 20, :auto => true}
)

objects.each do |o|
puts(“Indexing #{klass.to_s}: #{o.id}”)
o.solr_index = true
o.solr_save
end
klass.solr_optimize

end
end
end

Set up a cron job to run this every thirty minutes or so. For most sites, a half hour will be a good balance between keeping the load down and making sure the searching is fairly up to date.

By moving the indexing off the main thread, we’ve noticed a significant reduction in the number of Solr related exceptions. That means our users have seen a significant reduction in the number of “Sorry, we lost all your data” errors, and that is exactly what we were hoping for.

References

Build a crappy website, make a million dollars

Business, Projects 4 Comments »

I saw an article in the New York Times the other day about working 10 hours a week and making $10 million a year. The article profiles Markus Frind, who started an online dating website, Plenty of Fish. The main thrust of the article is that Markus barely works at all and just sits back raking in cash. Are there any other web entrepreneurs out there who are sick and tired of hearing this story over and over? This is all I ever hear, and I secretly dream about it while working on my projects, but it doesn’t match up well with the real world of late nights and no money.

Hmm, let’s look through the NY Times article and see what the keys are to creating a cash-cow website:

  • Just sit down and create a site as an experiment in teaching yourself a new programming language.
  • Ignore the interface and usability. If stuff looks bad, just say, “Users don’t care about that.”
  • Forget about customer service and moderation. Crowdsource it to the forums.
  • Don’t charge users for anything. Rely 100% on advertisements.
  • Lie in your hammock and collect checks.

Ok, I’ll admit, this is the life I want to live. This is the dream of all us web guys. However, I’m starting to suspect that the vast majority of “successful” web entrepreneurs invest a lot more effort and reap much smaller rewards. How many of us out there are pulling in $50, $100, or dare I say it, $500 a month from our projects? If I could get Obsidian Portal to generate $100 a month, I would be ecstatic. After paying off the hosting fees and whatnot, my take-home would probably be less then $1/hr. Still, it’s a goal to shoot for.

We can’t all be Digg, YouTube, or Facebook, but that’s OK. Success has different levels, and if we can just generate enough income to justify the amount of time we spend working on the sites we love, then that’s success. I just wish the media would profile a few more of us who live on the wrong side of profitability, work until 3:00 in the morning, and jump for joy with every subscription or CafePress T-shirt we sell.

Are you like me, barely scraping by (or not, as it were)? Drop a comment when you take a break from furiously writing code or begging someone not to cancel their account…

Update(Feb 2010) – We’re now making a decent amount of money from Obsidian Portal. Not enough to live on, but that goal is starting to become visible on the horizon. We’ve been at it for 3 years now, and maybe, maybe we’ll be ramen profitable in another 2-3. Still, from what I’ve seen of other web entrepreneurs, that’s a smashing success. So, to all you others out there in the same boat: It’s possible, but it’s slow and it’s damn hard.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in