Optimizing Solr and Rails – Index in the background

Plugins, Ruby on Rails 11 Comments »

Update: 2008-02-21 We’re looking into using ActiveMessaging and Amazon SQS to help with the workflow for background processing. Stay tuned for an updated post.

With before_save and after_save filters being so easy to use, it’s tempting to add more and more pre and post-processing to saving an ActiveRecord model. For Obsidian Portal, we update permissions, set timestamps of associated objects, and do all sorts of stuff. Unfortunately, all this extra work takes time, and can significantly slow down your application. The more work you do on the main execution thread, the more time Mongrel is tied up doing stuff unrelated to servicing requests. If something goes wrong in any of the filters, Rails will rollback the database transaction, and *poof* it’s all gone!

A while back, we started seeing ‘rbuf_fill’ timeout errors in the server logs. From what we could see, calls to acts_as_solr indexing were timing out, interrupting the save. For us, this was really bad. People would spend lots of time painstakingly crafting their perfect blog posting or wiki page, only to have it evaporate into nothing. All they saw was our default “Internal Server Error” page. Sure, it looks nice, but no one wants to see that ;)

Tracing the timeout back to Solr was not hard, and the solution was clear: take the indexing out of the main execution thread and move it to a background process. Luckily, acts_as_solr made this a fairly easy refactoring process. Here’s what we did:

Add an :if clause to your acts_as_solr macro call

acts_as_solr supports an :if clause that will be used to determine whether or not the record will be indexed when save is called. We want this to always evaluate to false, except when we explicitly set it to true during off-line processing. Below is an example from one of our models:

acts_as_solr :fields => [:name, :body, :post_title, :post_tagline, :slug],
:if => :solr_index?

def solr_index?
@solr_index
end
attr_writer :solr_index

Use rake/cron to do the indexing in the background.

Now that indexing does not happen on save, we need to make sure it happens at some point. Our solution was to move it to a rake task that gets executed by a periodic cron job. Rake + cron has worked well for us in the past, so we’ll stick with it.

The task itself is very simple. Find all the objects that have been updated since the last indexing, and push them to Solr.

Below is the rake task that I wrote. If I were more clever, I would probably come up with a neat trick for automatically finding all the models that support Solr indexing. Now that I’m an official committer on acts_as_solr, maybe I’ll try to figure something out and get it into the trunk. Still…I’m lazy :)

namespace :solr do
namespace :index do
desc “Indexes campaigns”
task :campaigns => :environment do
index_class(Campaign)
end

desc “Indexes wiki pages”
task :wiki_pages => :environment do
index_class(WikiPage)
end

desc “Indexes game contents”
task :game_contents => :environment do
index_class(GameContent)
end

desc “Indexes users”
task :users => :environment do
index_class(User)
end

desc “Indexes everything that we’re storing in solr”
task :all => [:campaigns, :wiki_pages, :game_contents, :users]

def index_class(klass)
# If REBUILD is set to “true” then we rebuild the entire index
rebuild = ENV["REBUILD"] ? ENV["REBUILD"] == “true” : false

interval = rebuild ? 100.years : 30.minutes

objects = klass.find(:all,
:conditions => ["updated_at > ?", Time.now - interval],
:page => {:size => 20, :auto => true}
)

objects.each do |o|
puts(“Indexing #{klass.to_s}: #{o.id}”)
o.solr_index = true
o.solr_save
end
klass.solr_optimize

end
end
end

Set up a cron job to run this every thirty minutes or so. For most sites, a half hour will be a good balance between keeping the load down and making sure the searching is fairly up to date.

By moving the indexing off the main thread, we’ve noticed a significant reduction in the number of Solr related exceptions. That means our users have seen a significant reduction in the number of “Sorry, we lost all your data” errors, and that is exactly what we were hoping for.

References

subversion vendor branches in action – going from 0.7 to 0.8.5 of acts_as_solr

Ruby on Rails 13 Comments »

As we have mentioned numerous times, we here at Aisle Ten prefer vendor branches for installing our plugins. We have found that most plugins require some tweaking or modification in order to put them use exactly as we want. Rather than being a strike against the plugin architecture, I count this as one of its greatest strengths. Plugins are usually simple enough that it is an easy task to understand what they do, and then modify them to support exactly what you need.

In my previous post, I gave directions on how to create a vendor branch for a plugin. What was missing was a good explanation of how to actually upgrade when a new release version comes from the vendor. Today I’ll cover that, using the real-life example of moving from acts_as_solr 0.7 to 0.8.5. All examples will relate to acts_as_solr where the version we currently have is 0.7 and the new version we want is 0.8.5.

Update:Thanks to Chris in the comments, there may be a way to do all of this “the correct way” Skip down to the bottom to see.

What the SVN book recommends

The definitive source for info regarding subversion is the book, Version Control with Subversion. It’s available online for free. If you haven’t skimmed through it, now’s the time.

In the section on vendor branches, they give a very terse explanation of how to go about this:

To perform this upgrade, we checkout a copy of our vendor branch, and replace the code in the current directory with the new libcomplex 1.1 source code. We quite literally copy new files on top of existing files, perhaps exploding the libcomplex 1.1 release tarball atop our existing files and directories. The goal here is to make our current directory contain only the libcomplex 1.1 code, and to ensure that all that code is under version control. Oh, and we want to do this with as little version control history disturbance as possible.

After replacing the 1.0 code with 1.1 code, svn status will show files with local modifications as well as, perhaps, some unversioned or missing files. If we did what we were supposed to do, the unversioned files are only those new files introduced in the 1.1 release of libcomplex–we run svn add on those to get them under version control. The missing files are files that were in 1.0 but not in 1.1, and on those paths we run svn delete. Finally, once our current working copy contains only the libcomplex 1.1 code, we commit the changes we made to get it looking that way.

Why it doesn’t work

The main problem with this approach has to do with files and directories that are deleted between versions. At the end of the branch upgrade process, two things you want to have are:

  • current contains all and only the code from 0.8.5. In other words, current is an identical copy of 0.8.5 from the acts_as_solr repository
  • Our repository contains the history of the transition from 0.7 to 0.8.5.

Getting the second one is a little tough, and getting them both together is very tricky. From the subversion book:

We quite literally copy new files on top of existing files, perhaps exploding the libcomplex 1.1 release tarball atop our existing files and directories. The goal here is to make our current directory contain only the libcomplex 1.1 code

What about deleted files? Copying the new files in or exploding the tarball will cover changed and added files, but any files or directories that were deleted will show up in svn as unmodified Using this method as described, it is impossible to know whether a file listed as unmodified is actually present in 0.8.5 without manually verifying its existence. For acts_as_solr this would be an annoyance. For a project with thousands of files, it would be a nightmare, and the human operator would surely make mistakes, meaning what you have in your repository would not be an exact copy of the vendor’s release. That’s about the worst possible outcome.

Solution 1: Strip out all except directories and .svn

One possible solution is to “clean” your working copy before exploding the tarball. In this case, you write a script that walks your working copy deleting everything except the directories and their .svn subdirectories. Now, when you explode the tarball and get a list of changes, it will tell you all the files that are now missing (and therefore are not part of the release). Then it’s easy to follow the book’s instructions and remove them from svn.

Why it doesn’t work

Besides the fact that I have not been able to find a script that does what I’m describing, it doesn’t solve the problem of deleted directories. If, during the change in versions, entire directories have been deleted, this method will not detect that. After running the script, you’re left with an empty skeleton of your current version. Exploding the tarball will overlay the directory structure of the new version, but any deleted directories will still be present. They’ll be empty, but they’ll still be there. Again, it is impossible, without manual intervention, to determine if any particular empty directory still belongs. This is a much better situation than before, where we had to check every unmodified file, but it’s still an annoyance, and allows for human error. On to the next solution…

Solution 2: Merge against the vendor’s repository

When I thought of this one, I was very excited. It should conceivably allow me to use svn to handle all the dirty work, which is what it does best. If your vendor allows you read access to their svn repository, you should be able to do a merge (diff) between your current version, and their newest version, and apply that to the current. It’s a standard merge operation saying “What do I have to do to make my copy look like theirs?”

Why it doesn’t work

Simply enough, svn does not allow you to merge between two different repositories. Why this is, I have no idea, but they must have their reasons. So, this idea takes a bullet to the head.

Solution 3: Import, then merge against your repository

When I realized that Solution 2 would not work, it occurred to me that I could just import the newest release into my repository and then merge with that.

Step by step:

  1. Do an export from the vendor’s repository (or explode the tarball) of the latest version to somewhere local (let’s say temp_latest)
  2. Import this code into your repository: svn import temp_latest http://your/repo/here/vendor/some_package/latest
  3. Check out current to somewhere: svn co http://your/repo/here/vendor/some_package/current temp_current
  4. Merge and apply: svn merge http://your/repo/here/vendor/some_package/current http://your/repo/here/vendor/some_package/latest temp_current
  5. Commit the changes: svn ci temp_current
  6. Bonus points – verify the integrity: svn diff http://your/repo/here/vendor/some_package/current http://your/repo/here/vendor/some_package/latest If there are no differences, then current now contains an exact copy of the new release. Good job!
  7. Tag the new current: svn copy http://your/repo/here/vendor/some_package/current http://your/repo/here/vendor/some_package/2.0

At this point, you can copy the new tag to your trunk and deal with any conflicts. This is all covered in the SVN book.

Why it works…but kind of sucks

This is my preferred solution, but it’s not perfect. My main issue is that each time it’s done, you have to import a brand new copy of the code into your repository in order to do the merge. This is pretty wasteful from a storage perspective, and can be prohibitive if the import is sizeable and/or the vendor puts out new releases on a fast schedule. It won’t be a problem for (most) plugins, but for larger projects it is probably impractical.

Update: In hindsight, I don’t think this actually works the way I intended. When you import the new files, svn does not realize that they’re in any way related to the files in current. Therefore, when you do the merge, rather than applying changes to the old files, it simply deletes them and adds the new ones. In other words, the merge simply replaces all the old files with the new ones, rather than inspecting and recording the changes that occurred. This process will discard any changes that you’ve made, making the whole process fairly worthless.

Going from 0.7 to 0.8.5

We chose solution 3, import and merge, for handling the move from 0.7 to 0.8.5 of acts_as_solr. I am happy with the result, as I noticed during the process that the plugin had been seriously re-organized, resulting in the deleting and moving of many files and directories. Had we tried any of the other solutions, it would have been a painstaking process figuring out where things had moved. With the route we took, svn handled all of that.

Once I decided to go that route, the entire process took about 15 minutes. There were no conflicts to deal with when merging to trunk, but that is probably due to the fact that the only changes I made resulted in patches that were accepted back into the acts_as_solr trunk.

Think you can do better?

These are the only solutions I could come up with. Some flat-out don’t work, while others have their plusses and minuses. If you have a suggestion for a better way, please post a comment and I’ll try to include your idea into the list.

Update: svn_load_dirs

What I haven’t mentioned up until now was svn_load_dirs. This is a perl script that is supposed to help manage vendor branches. I have fumbled with it in the past and simply gotten frustrated. I just couldn’t find good documentation.

However, Chris in the comments has posted a link to an excellent post detailing svn_load_dirs, and I recommend everyone check it out. The next time I have to do a vendor branch update, I will try following these directions.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in