acts_as_sphinx plugin

Posted on March 23, 2007

We can't imagine a web site without at least a rudimentary search functionality. What a frustration when we shop for a particular product or looking for an answer on a community web site and can't find exactly what we are looking for. The search is the paramount for any online business. If a customer can't find a product she's looking for, she would go somewhere else and you lose a sale.

Unfortunately, most of the databases don't provide full-text search capabilities or provide a minimal support which is not enough in most of the cases.

There are several options for Ruby developer to choose from. First of all is wonderful Ferret. This library rocks! I've already written an article of how to use it to index your ri database. It provides a very reach API that is based on Apache Lucene library. And it is considered the one of the fastest implementations of IR library out there. So, if you really need to add a functionality that goes beyond plain search capabilities, like adding the highlighting to your search results, using a specific analyzer, or keeping the real data along with your index, you should definitely consider this library. The integration becomes even simpler, because of the work being done by Jens Kramer with his acts_as_ferret plugin. There is also very informative and detailed article by guys from RailsEnvy blog.

But there is another solution for the same problem and, from my prospective, it is much easier to use and install.

Introducing sphinx

Sphinx is being developed by a Russian programmer Andrew Aksyonoff. It's a general purpose IR library. What makes it so special is the incredible, jaw-dropping speed of searching and indexing. There are several posts written by developers of MySQL Performance Blog that advertise sphinx as a powerful search engine software. In this post they describe how sphinx is being used in one of their commercial projects. I've been using it in one of my projects for several months already and it proved to be very fast and stable.

What makes it so fast, I think, besides of cause the genius of Andrew, is that sphinx provides the tradeoff of the available functionality in favor of speed. There is just not so much you can tweak with the provided API. But, mind you, the API covers more than 95% of all cases that you need for a general purpose search functionality required by most web sites. Also the latest versions come with the "extended" search mode, that moves this library closer to its competitors.

OK, in order to use it, you need to install it first. Let's see how we do it:

Installation procedure

  1. You can download it from the Andrew's website.

    $ wget http://www.sphinxsearch.com/downloads/sphinx-0.9.7-rc2.tar.gz
    
  2. Use the standard installation procedure.

    $ ./configure && make && sudo make install
    

    By completing these steps you have installed three applications:

    • searchd is a daemon process that serves incoming search requests.
    • indexer is an application that builds your index.
    • search is an application for testing your queries.
  3. In order to make it all work you need to create a configuration file.

Sphinx configuration file

I'm going to demonstrate how to create an example configuration file based on a simple database called mydatabase that has two tables: products and brands. products table has a foreign key to brands table.

First of all, we define a data source for our index:

    source products
    {
        type                = mysql
        sql_host            = localhost
        sql_user            = root
        sql_pass            = myrootpwd
        sql_db              = mydatabase
        sql_sock            = /tmp/mysql.sock

        sql_query           = \
         SELECT products.id, products.name, brands.name, products.description \
         FROM products INNER JOIN brands ON products.brand_id = brands.id
    }

You can use either Mysql or Postgres databases. There is also an option to use plain files for your datasource, please refer to the sphinx documentation. The type parameter defines what type of the database you are using and the rest of sql_* parameters define how to access your database. The sql_query parameter provides a query which sphinx will be using for indexing. As you can see, our data has four fields. The first field must always be the document id, which sphinx returns as the result of a query. All other three fields will be used in your searches and you can assign a specific relevance weight to each of them.

Now we define our index itself:

    index products
    {
        source          = products
        path            = ../sphinx/products

        # morphology
        morphology          = stem_en
    }

Here we specify the data source for our index that we previously defined and where to store index files. Optionally, you can set several other attributes. Here, for example, I decided to use a particular morphology algorithm. For more options, consult the sphinx documentation.

In our last step, we provide options for the indexer application and searchd server:

    indexer
    {
        # memory limit
        mem_limit           = 32M
    }

    searchd
    {
        address             = 127.0.0.1
        port                = 3312
        log                 = ../log/searchd.log
        query_log           = ../log/searchd_query.log
        read_timeout        = 5
        max_children        = 30
        pid_file            = ../log/searchd.pid
        max_matches         = 1000
    }

These are more or less self-explanatory. You can find the whole configuration file here: sphinx.conf. Move this file to your <rails-app>/config directory.

Introducing acts_as_sphinx plugin

You can install this plugin by running this command from the top directory of your Rails application:

    $ ./script/plugin install http://svn.datanoise.com/acts_as_sphinx

This plugin extends your ActiveRecord model classes by adding sphinx search methods. This plugin uses sphinx.rb API library developed by Dmytro Shteflyuk that now is included with the sphinx package. Read Dmytro's post for more details about it.

Here is how you use this plugin:

  1. We create a model class for each of our tables:

    $ ./script/generate model product
    $ ./script/generate model brand
    
  2. Since we defined products index in our configuration file, we are going to extend the corresponding Product model. Open app/model/product.rb and modify it as follows:

    class Product < ActiveRecord::Base
       belongs_to :brand
       acts_as_sphinx
    end
    

    acts_as_sphinx macro modifies our Product class and introduces two methods: Product.ask_sphinx and Product.find_with_sphinx. By default, it uses the name of the table associated with the model class as the name of the index. So in this case, Product.find_with_sphinx will be using our products index. Refer to acts_as_sphinx source code for the detailed documentation of these methods.

  3. Now we need to prepare our index data. acts_as_sphinx introduces several rake tasks in sphinx: namespace:

    • sphinx:index - creates/rebuilds all indexes defined in <rails-app>/config/sphinx.conf file. Note that searchd daemon must be stopped when you run this task!

    • sphinx:start - starts searchd daemon using settings from <rails-app>/config/sphinx.conf file

    • sphinx:stop - stops searchd daemon

    • sphinx:rotate - rebuilds all indexes and sends searchd daemon a signal to read new index files. Note that searchd must be running when using this task!

    This way, in order to build our initial index data, we use these commands:

    $ make sphinx
    $ rake sphinx:index
    
  4. OK, we are ready to start sphinx server:

    $ rake sphinx:start
    
  5. Now let's test our new search engine:

    $ ./script/console
    >> res = Product.find_with_sphinx 'ipod'
    => ...
    >> res.total
    => 3
    >> res.time
    => "0.00"
    

Basic query functionality

find_with_sphinx method takes all parameters that you can pass to ActiveRecord::Base#find method plus a special :sphinx key. This key points to a hash of sphinx specific parameters. These are some of them:

  • :mode defines the search mode (:all, :any, :boolean, :extended)

  • :limit restricts result to a specified number of objects, default is 20

  • :offset returns results from a specific offset, default is 0

  • :page can be used instead of :offset option to specify the page number

  • :index overrides the default index name

  • :weight is an array of weights for each index component (used in the relevance algorithm)

For example, to make product and brand names more preferable than the product description:

    Product.find_with_sphinx query, :sphinx => {:width => [100, 100, 50]}

Pagination with acts_as_sphinx

It's quite easy to use Rails pagination with this plugin. Following the example from acts_as_ferrent article:

add to your app/controllers/application.rb:

    PER_PAGE = 10 unless defined? PER_PAGE

    def pages_for(size, options = {})
      default_options = {:per_page => PER_PAGE}
      options = default_options.merge(options)
      Paginator.new self, size, options[:per_page], (options[:page] || 1)
    end

add to app/controllers/product_controller.rb:

    @products = Product.find_with_sphinx query, :include => :brand,
      :sphinx => {:limit => PER_PAGE, :page => @page}
    @product_pages = pages_for @products.total, :page => @page

and you are good to go.

Index live updates

This is probably not a good idea to update your index on every change to your model objects. That's why acts_as_sphinx doesn't provide any callback methods. Instead, I would recommend to schedule your index updates once or several times a day depending on your needs. It all depends on your particular case, the size of your database, and the frequency of updates. If you have a huge database you can use main+delta update schema as described in the sphinx documentation. Personally I am using cron job to rotate my index every three hours using

    $ rake sphinx:rotate

command.

Also sphinx provides a very nice feature where you can partition your index and serve each part from a separate server. In order to query all this parts at the same time you define an index of a special distributed type. This way you can have the main infrequently modified index on one server and keep deltas on another. Once again, refer to the sphinx documentation for more information.

Conclusion

Sphinx library is a very powerful tool and useful addition to your website.

Comments
  1. evanMarch 23, 2007 @ 03:43 AM

    Damn. I have a Sphinx plugin in hiding, too; we should have collaborated.

    NOW THEY WILL FIGHT.

  2. KentMarch 23, 2007 @ 11:36 AM

    Oh, man, I had no idea. I'll give up on mine if yours is better.

  3. evanMarch 23, 2007 @ 04:34 PM

    Mine introspects the database to build the Sphinx configuration for you; it also lets you merge every indexed model into a unified fieldset for use in a complete index. And it abuses the stopword list to build a custom Aspell dictionary for query suggestion. It's not in a distributable state quite yet--we're scheduled to deploy it in real life in two weeks.

    What's the license for your work? I might steal it. Some of your query API seems better.

  4. evanMarch 23, 2007 @ 04:37 PM

    Oh, by the way, you probably want this patch: http://www.sphinxsearch.com/forum/view.html?id=294#1720

  5. KentMarch 24, 2007 @ 01:18 PM

    evan,

    I've been thinking about the auto-generation of sphinx.conf file, but I figured that most of the time I need to tweak this file manually anyway. It would be nice to see how you did it though.

    I'm using MIT license so you can 'steal' anything you want. :-)

    I am aware of this patch. Actually I'm running of the CVS version that contains several other important bug fixes, like the nasty bug related to index rotation. I hope Andrew will release the next version soon.

  6. evanMarch 24, 2007 @ 04:01 PM

    CVS version; even better. I'll request it from Andrew.

    I'll send you an email about all this...

  7. HenryMarch 26, 2007 @ 10:17 PM

    Kent,

    I'm currently using acts_as_ferret and Ferret has a potential to corrupt the indices in a multiple server configuration(e.g. multiple mongrel's). A DrbServer is being develop to solve the corrupt problem. I assume sphinx doesn't have any problem in a multi-server config because the index isn't updated at the same time ActiveRecord data is updated? Thanks.

  8. KentMarch 26, 2007 @ 11:16 PM

    Henry,

    With sphinx you can corrupt your index, if you start rebuilding your indexes while searchd is serving requests. That's why you give --rotate option to the indexer application. It makes indexer rebuild your index using a new set of files. When it's done, it will send a signal to searchd to forget about old files and switch to new ones. This way you avoid any index corruption.

  9. Zed A. ShawApril 02, 2007 @ 01:25 PM

    Hey, I've got a patch to the Sphinx ruby api that fixes a bunch of stuff and cleans up the API so you configure everything. Hit me up if you want it before I send it to the author.

  10. Dmytro ShteflyukApril 05, 2007 @ 11:35 AM

    Heh :-) "Author" has updated his Ruby API (it's me)

  11. NimaApril 25, 2007 @ 09:04 AM

    Great article!

  12. AviMay 18, 2007 @ 12:31 AM

    Evan,

    Did you ever get that sphinx plugin with the model introspection for generating the sphinx.conf file? I'm converting from ferret and would really like to avoid writing that sql. I guess AR spoils me.