Rewriting the full-text indexer

The current rdbms indexer is problematic if you want to import a lot of issues. I've almost finished doing an import from SF to a Roundup postgresql database and the __words table now has 2.237.103 rows. Richard already added a new index to 0.7.7 to speed up adding text, but hey, it's more than 2 million rows! That's problematic, new index or no new index.

So, what's my alternative? I suggest we use MySQL?'s and PostgreSQL?'s built in full text indexers, and fall back to our current scheme if those aren't available. I have many ambitions in life, but writing full-text indexers is not one of them. Also, I don't think we could get it anywhere near as fast as the MySQL?/PostgreSQL? hackers can.

That would mean we should explicitly state the full text search won't be the same as the rest of the searches. This is already the case, as http://sourceforge.net/tracker/index.php?func=detail&aid=724708&group_id=31577&atid=402788 illustrates, but I'd like it to be explicit.


comments:

full-text indexing --richard, Wed, 13 Oct 2004 10:07:23 +1000 reply
The indexer is abstracted out and we could replaced it wholly in the backends that provide their own full-text indexing. I'd be perfectly happy to do so, BTW, I just never found the time to look into it.

As for the searches being slightly different - I really don't care - it's not like we're going to have people comparing different backends' full-text searching :)

Indexer goals and non-goals:

richard: I'm leery of asking users to install more software...

johannes: That's why I don't want to get rid of the old-style full-text indexer, but only provide tsearch2 as an option. That's the way Roundup usually does it: installation is very simple, but if you need more scalability, you can add a new backend/mod_python/tsearch2.

Note, by the way, that tsearch2 is part of the postgresql distribution, it just isn't enabled by default. If we choose to use tsearch2, I'll provide installation instructions for source distributions and Debian (apt-get install postgresql-contrib ;).

Indexer internal design issues:

johannes: I would choose the first option, if only because I have a clue as to how to move text into the database, and don't have a clue how to extend tsearch2.

richard: hmmm, I can't recall why it was added.

richard: yes, the current way indexer_rdbms extends indexer_dbms is wrong.

johannes: I think this is the full proposal. I'll implement it if you want to see it in Roundup 0.8.

richard: I think it's worthwhile.

Modifications to standard indexer

These may not be needed if we can hook in the built-in indexing in mysql and postgresql, but it might be worth investigating an adaptive stop-words list (anything over 1000 or so entries) and using a stemmer (http://www.tartarus.org/~martin/PorterStemmer/) ... though the latter will only help English, and may screw up other languages.