Advanced Sphinx Configuration
Thinking Sphinx provides a good set of defaults out of the box, and for some people, those options are exactly what they need. Sometimes, though, you may need to customise how Sphinx works – and this can usually be done by adding some settings to a file named sphinx.yml in your config directory. Much like database.yml, settings are defined for each environment. Here’s an example:
development: port: 9312 test: port: 9313 production: port: 9312
Now, Sphinx has a lot of different settings you can play with, and they’re pretty much all supported by Thinking Sphinx as well. Documentation will be added here for them over time, but in a pinch, it should be pretty easy to guess the syntax for the YAML file for each setting.
Index File Location
You can customise the location of your Sphinx index files using the
Thinking Sphinx defaults to putting these files in db/sphinx/ENVIRONMENT – which makes life easier if you’re running integration tests with a live Sphinx setup. It’s worth keeping this in mind and ensuring your file locations are unique for each environment when they share a machine. Indeed, you’ll probably only want to change this value on your production machine.
production: searchd_file_path: "/var/www/latest_web20_craze/shared/sphinx" # ... repeat for other environments if necessary
Configuration, PID and Log File Locations
In the same vein as the above setting, you can nominate custom locations for your configuration, log and pid files.
Here’s some example syntax, using Thinking Sphinx’s defaults. Uppercase words are placeholders for system variables – you can’t actually use them in your YAML file.
development: config_file: "RAILS_ROOT/config/ENVIRONMENT.sphinx.conf" searchd_log_file: "RAILS_ROOT/log/searchd.log" query_log_file: "RAILS_ROOT/log/searchd.query.log" pid_file: "RAILS_ROOT/log/searchd.ENVIRONMENT.pid" # ... repeat for other environments
Daemon Address and Port
If your Sphinx Daemon (also known as searchd) is running on a different machine or port, you’re going to need to tell Thinking Sphinx the critical details:
production: address: 10.0.0.4 port: 3200 # ... repeat for other environments if necessary
Indexer Memory Usage
Sphinx indexes your data using the
indexer command-line tool. This tool runs with a fixed memory limit – defaulting to 64 megabytes. You can change this to something else if you’d like – the more memory, the faster your indexes will be processed.
development: mem_limit: 128M # ... repeat for other environments
Word Stemming / Morphology
By default, Sphinx and Thinking Sphinx doesn’t get too smart about the words you’re searching for – it assumes you know exactly what you’re after. However, sometimes you may want it to recognise that certain words share pretty much the same meaning. For example: think and thinking.
To enable this kind of behaviour, you need to specify a morphology (or stemming library) to Sphinx. It comes with English (stem_en) and Russian (stem_ru) built-in. You can also use other stemmers via Snowball’s libstemmer library. Have a read of Sphinx’s documentation for more clues.
development: morphology: stem_en # ... repeat for other environments
By default, Sphinx does not pay any attention to wildcard searching using an asterisk character. You can turn it on, though:
development: enable_star: true # ... repeat for other environments
You’ll almost certainly want to enable infix or prefix indexing as well, though (read the next section).
Infix and Prefix Indexing
If you want partial word matching, then you’re going to need to tell Sphinx to either index prefixes (the beginnings of words) or infixes (substrings of words). You cannot enable both at once, though.
You need to tell Sphinx what the minimum infix or prefix length is – the smaller the number is, the larger your index gets. If you set it to zero, though, that disables this feature. If you want absolutely everything, down to the last character, then set min_infix_len to 1 – but be prepared for the performance hit.
development: min_infix_len: 3 # OR min_prefix_len: 3 # ... repeat for other environments
Character Sets and Tables
By default, Thinking Sphinx uses the UTF-8 character set. If you wish to use Sphinx’s inbuild sbcs encoding, you’ll need to specify it via the charset_type setting:
development: charset_type: sbcs # ... repeat for other environments
This changest the default character mappings, which you can read about in the Sphinx documentation. You can also set your own character mappings – which is recommended when using UTF-8 – to include other characters. James Healy has posted his extensive settings which cover most (if not all) accented characters. If you don’t want to click through, it’s all done via the charset_table setting:
development: charset_table: "0..9, A..Z->a..z, _, a..z, \ U+410..U+42F->U+430..U+44F, U+430..U+44F" # ... repeat for other environments
Large Result Sets
To keep searching fast, Sphinx has a default limit of 1000 records being available via pagination, even if there are more matches than that. The reasons for this limit are discussed in the Sphinx documentation.
However, you can change this value. Firstly, in your
config/sphinx.yml file, you need to set max_matches to your upper limit:
development: max_matches: 10000 # ... repeat for other environments
Don’t forget to rebuild your Sphinx indexes so the daemon is aware of the change.
And you also need to specify it in your searches (Sphinx doesn’t assume you want the higher number by default):
Article.search 'pancakes', :max_matches => 10_000
This does not mean you will get 10,000 results returned in one request, but you can paginate up to the ten-thousandth result. If you want them all at once (which will be slow, because you’re asking Rails to instantiate 10,000 records), use the
Article.search 'pancakes', :max_matches => 10_000, :per_page => 10_000
While not related to Sphinx, this setting is to provide faster loading of the indexed models by Thinking Sphinx. Normally, Thinking Sphinx has to load all models to determine which ones are indexed. This is not ideal, so if you like, you can explicitly list the relevant models in your
development: indexed_models: - Article - Company - User
Given a standard production environment does not re-initialize the app on every request, this is only useful in development. And make sure you remember to update it if you add index definitions to models!
Word Forms, Exceptions, and Stop Words
To configure Thinking Sphinx for any of these features, simply specify the path to the appropriate file in your
development: wordforms: "/full/path/to/wordforms.txt" exceptions: "/full/path/to/exceptions.txt" stopwords: "/full/path/to/stopwords.txt" # ... repeat for other environments
For full details on what these features actually do, please refer to the Sphinx documentation.