We can't imagine a web site without at least a rudimentary search functionality. What a frustration when we shop for a particular product or looking for an answer on a community web site and can't find exactly what we are looking for. The search is the paramount for any online business. If a customer can't find a product she's looking for, she would go somewhere else and you lose a sale.
Unfortunately, most of the databases don't provide full-text search capabilities or provide a minimal support which is not enough in most of the cases.
There are several options for Ruby developer to choose from. First of all is wonderful Ferret. This library rocks! I've already written an article of how to use it to index your ri database. It provides a very reach API that is based on Apache Lucene library. And it is considered the one of the fastest implementations of IR library out there. So, if you really need to add a functionality that goes beyond plain search capabilities, like adding the highlighting to your search results, using a specific analyzer, or keeping the real data along with your index, you should definitely consider this library. The integration becomes even simpler, because of the work being done by Jens Kramer with his acts_as_ferret plugin. There is also very informative and detailed article by guys from RailsEnvy blog.
But there is another solution for the same problem and, from my prospective, it is much easier to use and install.
Introducing sphinx
Sphinx is being developed by a Russian programmer Andrew Aksyonoff. It's a general purpose IR library. What makes it so special is the incredible, jaw-dropping speed of searching and indexing. There are several posts written by developers of MySQL Performance Blog that advertise sphinx as a powerful search engine software. In this post they describe how sphinx is being used in one of their commercial projects. I've been using it in one of my projects for several months already and it proved to be very fast and stable.
What makes it so fast, I think, besides of cause the genius of Andrew, is that sphinx provides the tradeoff of the available functionality in favor of speed. There is just not so much you can tweak with the provided API. But, mind you, the API covers more than 95% of all cases that you need for a general purpose search functionality required by most web sites. Also the latest versions come with the "extended" search mode, that moves this library closer to its competitors.
OK, in order to use it, you need to install it first. Let's see how we do it:
Installation procedure
You can download it from the Andrew's website.
$ wget http://www.sphinxsearch.com/downloads/sphinx-0.9.7-rc2.tar.gzUse the standard installation procedure.
$ ./configure && make && sudo make installBy completing these steps you have installed three applications:
searchdis a daemon process that serves incoming search requests.indexeris an application that builds your index.searchis an application for testing your queries.
In order to make it all work you need to create a configuration file.
Sphinx configuration file
I'm going to demonstrate how to create an example configuration file based on a simple database called mydatabase that has two tables: products and brands. products table has a foreign key to brands table.
First of all, we define a data source for our index:
source products
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = myrootpwd
sql_db = mydatabase
sql_sock = /tmp/mysql.sock
sql_query = \
SELECT products.id, products.name, brands.name, products.description \
FROM products INNER JOIN brands ON products.brand_id = brands.id
}
You can use either Mysql or Postgres databases. There is also an option to use plain files for your datasource, please refer to the sphinx documentation. The type parameter defines what type of the database you are using and the rest of sql_* parameters define how to access your database. The sql_query parameter provides a query which sphinx will be using for indexing. As you can see, our data has four fields. The first field must always be the document id, which sphinx returns as the result of a query. All other three fields will be used in your searches and you can assign a specific relevance weight to each of them.
Now we define our index itself:
index products
{
source = products
path = ../sphinx/products
# morphology
morphology = stem_en
}
Here we specify the data source for our index that we previously defined and where to store index files. Optionally, you can set several other attributes. Here, for example, I decided to use a particular morphology algorithm. For more options, consult the sphinx documentation.
In our last step, we provide options for the indexer application and searchd server:
indexer
{
# memory limit
mem_limit = 32M
}
searchd
{
address = 127.0.0.1
port = 3312
log = ../log/searchd.log
query_log = ../log/searchd_query.log
read_timeout = 5
max_children = 30
pid_file = ../log/searchd.pid
max_matches = 1000
}
These are more or less self-explanatory. You can find the whole configuration file here: sphinx.conf.
Move this file to your <rails-app>/config directory.
Introducing acts_as_sphinx plugin
You can install this plugin by running this command from the top directory of your Rails application:
$ ./script/plugin install http://svn.datanoise.com/acts_as_sphinx
This plugin extends your ActiveRecord model classes by adding sphinx search methods. This plugin uses sphinx.rb API library developed by Dmytro Shteflyuk that now is included with the sphinx package. Read Dmytro's post for more details about it.
Here is how you use this plugin:
We create a model class for each of our tables:
$ ./script/generate model product $ ./script/generate model brandSince we defined products index in our configuration file, we are going to extend the corresponding
Productmodel. Openapp/model/product.rband modify it as follows:class Product < ActiveRecord::Base belongs_to :brand acts_as_sphinx endacts_as_sphinxmacro modifies our Product class and introduces two methods:Product.ask_sphinxandProduct.find_with_sphinx. By default, it uses the name of the table associated with the model class as the name of the index. So in this case,Product.find_with_sphinxwill be using our products index. Refer toacts_as_sphinxsource code for the detailed documentation of these methods.Now we need to prepare our index data.
acts_as_sphinxintroduces several rake tasks insphinx:namespace:sphinx:index - creates/rebuilds all indexes defined in
<rails-app>/config/sphinx.conffile. Note thatsearchddaemon must be stopped when you run this task!sphinx:start - starts
searchddaemon using settings from<rails-app>/config/sphinx.conffilesphinx:stop - stops
searchddaemonsphinx:rotate - rebuilds all indexes and sends
searchddaemon a signal to read new index files. Note thatsearchdmust be running when using this task!
This way, in order to build our initial index data, we use these commands:
$ make sphinx $ rake sphinx:indexOK, we are ready to start sphinx server:
$ rake sphinx:startNow let's test our new search engine:
$ ./script/console >> res = Product.find_with_sphinx 'ipod' => ... >> res.total => 3 >> res.time => "0.00"
Basic query functionality
find_with_sphinx method takes all parameters that you can pass to ActiveRecord::Base#find method plus a special :sphinx key. This key points to a hash of sphinx specific parameters. These are some of them:
:modedefines the search mode (:all, :any, :boolean, :extended):limitrestricts result to a specified number of objects, default is 20:offsetreturns results from a specific offset, default is 0:pagecan be used instead of :offset option to specify the page number:indexoverrides the default index name:weightis an array of weights for each index component (used in the relevance algorithm)
For example, to make product and brand names more preferable than the product description:
Product.find_with_sphinx query, :sphinx => {:width => [100, 100, 50]}
Pagination with acts_as_sphinx
It's quite easy to use Rails pagination with this plugin. Following the example from acts_as_ferrent article:
add to your app/controllers/application.rb:
PER_PAGE = 10 unless defined? PER_PAGE
def pages_for(size, options = {})
default_options = {:per_page => PER_PAGE}
options = default_options.merge(options)
Paginator.new self, size, options[:per_page], (options[:page] || 1)
end
add to app/controllers/product_controller.rb:
@products = Product.find_with_sphinx query, :include => :brand,
:sphinx => {:limit => PER_PAGE, :page => @page}
@product_pages = pages_for @products.total, :page => @page
and you are good to go.
Index live updates
This is probably not a good idea to update your index on every change to your model objects. That's why acts_as_sphinx doesn't provide any callback methods. Instead, I would recommend to schedule your index updates once or several times a day depending on your needs. It all depends on your particular case, the size of your database, and the frequency of updates. If you have a huge database you can use main+delta update schema as described in the sphinx documentation. Personally I am using cron job to rotate my index every three hours using
$ rake sphinx:rotate
command.
Also sphinx provides a very nice feature where you can partition your index and serve each part from a separate server. In order to query all this parts at the same time you define an index of a special distributed type. This way you can have the main infrequently modified index on one server and keep deltas on another. Once again, refer to the sphinx documentation for more information.
Conclusion
Sphinx library is a very powerful tool and useful addition to your website.
Damn. I have a Sphinx plugin in hiding, too; we should have collaborated.
NOW THEY WILL FIGHT.
Oh, man, I had no idea. I'll give up on mine if yours is better.
Mine introspects the database to build the Sphinx configuration for you; it also lets you merge every indexed model into a unified fieldset for use in a complete index. And it abuses the stopword list to build a custom Aspell dictionary for query suggestion. It's not in a distributable state quite yet--we're scheduled to deploy it in real life in two weeks.
What's the license for your work? I might steal it. Some of your query API seems better.
Oh, by the way, you probably want this patch: http://www.sphinxsearch.com/forum/view.html?id=294#1720
evan,
I've been thinking about the auto-generation of sphinx.conf file, but I figured that most of the time I need to tweak this file manually anyway. It would be nice to see how you did it though.
I'm using MIT license so you can 'steal' anything you want. :-)
I am aware of this patch. Actually I'm running of the CVS version that contains several other important bug fixes, like the nasty bug related to index rotation. I hope Andrew will release the next version soon.
CVS version; even better. I'll request it from Andrew.
I'll send you an email about all this...
Kent,
I'm currently using acts_as_ferret and Ferret has a potential to corrupt the indices in a multiple server configuration(e.g. multiple mongrel's). A DrbServer is being develop to solve the corrupt problem. I assume sphinx doesn't have any problem in a multi-server config because the index isn't updated at the same time ActiveRecord data is updated? Thanks.
Henry,
With sphinx you can corrupt your index, if you start rebuilding your indexes while
searchdis serving requests. That's why you give--rotateoption to theindexerapplication. It makesindexerrebuild your index using a new set of files. When it's done, it will send a signal tosearchdto forget about old files and switch to new ones. This way you avoid any index corruption.Hey, I've got a patch to the Sphinx ruby api that fixes a bunch of stuff and cleans up the API so you configure everything. Hit me up if you want it before I send it to the author.
Heh :-) "Author" has updated his Ruby API (it's me)
Great article!
Evan,
Did you ever get that sphinx plugin with the model introspection for generating the sphinx.conf file? I'm converting from ferret and would really like to avoid writing that sql. I guess AR spoils me.