Skip to:
Content
Pages
Categories
Search
Top
Bottom

Fulltext searching very slow


  • Mark Barnes
    Member

    @mark8barnes

    I have a large forum, and full-text searching is very slow. Looking at the slow queries log, 30-45 seconds seems about ‘normal’. Any suggestions about what could be done? tina.nl (which is bigger than mine has an almost instant search response)

Viewing 20 replies - 1 through 20 (of 20 total)

  • Mark Barnes
    Member

    @mark8barnes

    On a related note, I realised that my forum home page was taking a long time to load. My queries log showed that getting the recent posts query was taking 3 seconds to run. (Naturally enough if you have tens of thousands of topics.) An index on topic_sticky on {$bbdp->prefix}topics solved that problem.

    Even worse, going into the admin pages would take anything upto 100 seconds, I think just displaying the “Recently Moderated” panel. An index on post_status on {$bbdp->prefix}posts was needed there.

    Perhaps the devs should think about those two indexes as standard?

    My fulltext problem remains, however.


    _ck_
    Participant

    @_ck_

    I’m curious about this issue as performance problems always intrigue me. You must be using 0.9 as 1.0 has an index on stickies by default.

    As far as fulltext search there is a trick I use because of the multiple issues with fulltext (not only speed but fulltext can’t do words less than 4 characters until you customize and rebuild mysql). The trick is to use regex and do a two pass query where you first exclude all the posts without the words and then allow mysql to do a regular scan of the remaining posts.

    Query example from my Super-Search plugin:

    WHERE post_text  LIKE '%".$term."%'" AND post_text  REGEXP ':<:".$term.":>:[^']' "

    Compare the performance of that against a fulltext search that uses "MATCH post_text AGAINST $term" I don’t have enough data to do a huge benchmark but some simple tests with the cache off shows 0.4 seconds for the trick and 0.9 for the fulltext.

    The only downside to the trick is you cannot do partial word searches that way. ie. $term="cat" will only return posts with the exact word “cat” and not “cats” or “category”. But it should be way faster.

    Searching a huge number of posts is a non-trivial problem. It’s been known to crush other forums like vbulletin which has fancy code to prevent users from searching too often/too quickly and even disable search temporarily on high server load. Sites like Wikipedia have to go through several technically complex tricks to keep the search fast on that much data.

    Many large sites end up using sphinxsearch to replace fulltext search. You could interface it to bbPress via their PHP api.

    You can read more workarounds on the mysql fulltext search page (with far more knowledgeable people than me) http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html


    Mark Barnes
    Member

    @mark8barnes

    I tested this on a very large database under very low load. The standard bbpress search creates two queries, namely:

    SELECT p.*, 0 AS search_score, MAX(post_time) AS post_time FROM bb_posts AS p JOIN bb_topics as t ON ( t.topic_id = p.topic_id ) WHERE p.post_text LIKE '%test%' AND p.post_status = '0' AND t.topic_status = '0' GROUP BY t.topic_id ORDER BY p.post_time DESC LIMIT 5;

    and

    SELECT t.*, MIN(p.post_id) as post_id, GROUP_CONCAT(p.post_text SEPARATOR ' ') AS post_text, 0 AS search_score FROM bb_topics AS t JOIN bb_posts as p ON ( t.topic_id = p.topic_id ) WHERE t.topic_status = '0' AND p.post_status = '0' AND ( (t.topic_title LIKE '%test%') OR p.post_text LIKE '%test%' ) GROUP BY t.topic_id ORDER BY t.topic_time DESC LIMIT 30

    I ran both queries five times with different search terms. On my very large database, with no optimisation, the first query took (167s, 104s, 36s, 29.7s, 24s), the second (33s, 21s, 22s, 20s, 21s).


    Mark Barnes
    Member

    @mark8barnes

    In other words, bbpress is not using MATCH … AGAINST at all, it’s using the horribly slow LIKE %…%

    You couldn’t tell me the hook for manipulating the search facility, could you?


    _ck_
    Participant

    @_ck_

    Actually, the best thing to do is completely replace the search facility. That’s what I do in Super Search. bbPress’s search is very weak, so weak that there wasn’t even a link to it on any page in 0.9 (it’s similar to WordPress’s search, which also sucks, it’s the ugly truth no one seems to talk about).

    I assume you are using the bb-benchmark plugin to watch those queries happen (if not, you should be).

    Stupid question but you DO have the mysql cache turned on? I only ask because on many server configs (like CPANEL) it’s turned off by default. What does your my.cnf look like? (do a cat /etc/my.cnf in your shell)

    Try going into phpmyadmin (or command line) and test that first test query against adding AND post_text REGEXP ':<:%test%:>:' to the query like so:

    SELECT p.*, 0 AS search_score, MAX(post_time) AS post_time FROM bb_posts AS p JOIN bb_topics as t ON ( t.topic_id = p.topic_id ) WHERE p.post_text LIKE '%test%' AND p.post_text REGEXP ':<:%test%:>:' AND p.post_status = '0' AND t.topic_status = '0' GROUP BY t.topic_id ORDER BY p.post_time DESC LIMIT 5;

    I suspect in the end due to your huge db size you are going to need to completely replace the search functions with something like this http://sphinxsearch.com which has a PHP api. Fortunately in bbPress it’s very simple to hook the internal search and bypass it entirely without even template hacks. A quick google shows that a few WordPress plugins have sphinxsearch support so that would be easy to copy over to bbPress. If you find the keyword “sphinxsearch” on this source for example, you can see how it’s done: http://svn.scriblio.net/plugin/trunk/scriblio.php


    Sam Bauers
    Participant

    @sambauers

    @_ck_

    If you want we can work together on improving the existing search to make it better. We should be offering a “super” search in core.


    _ck_
    Participant

    @_ck_

    Super-Search is currently over 20k and only about halfway done in what it needs to do. All the bbPress core needs is a proper way to and/or words and disable partial matching instead of defaulting to just searching for every single fraction of each word.


    Mark Barnes
    Member

    @mark8barnes

    @sambauers:

    It seems to me that the most important thing that you can do very easily is to actually utilise the full-text index by using MATCH AGAINST rather than LIKE %…%. That’s an easy win. This also makes relevancy searching possible.

    The next thing would be to ensure that post_status is updated when topic_status is changed. That way, you could ignore topic_status when searching posts, and lose the join.


    _ck_
    Participant

    @_ck_

    But keeping a full-text index can get rather huge, which is probably why bbPress doesn’t use it by default.

    BTW, MySQL 5.1 apparently has some nice improvements in fulltext search.

    If you have your own dedicated or vps server you can tinker with some mysql settings to speed things up. (You never did answer if you have mysql cache turned on)


    _ck_
    Participant

    @_ck_

    Oh and looking at a MySQL performance guide for text search, apparently bbPress’s search is using ALL of the slowdown triggers:

    order by

    group by

    and worst of all SQL_CALC_FOUND_ROWS

    How long does this query take without the modifiers:

    SELECT * FROM bb_posts WHERE post_text LIKE '%test%' LIMIT 5;

    because that’s as fast as it can possibly get without an index and should be used as a baseline.

    Looking at the original query, I believe they are causing a FULL scan of the entire table to rank and group it first. That’s going to be insanely slow.

    You might be able to just use LIKE with a simple sort by post date DESC.

    The largest bbPress install I have access to has nearly 100,000 rows and still only takes 0.04 seconds for that trimmed query. The original queries you posted take 0.5 seconds so that’s 10x slower.


    Mark Barnes
    Member

    @mark8barnes

    @_ck_:

    I currently have:

    have_query_cache: YES

    query_cache_limit 1048576

    query_cache_min_res_unit 4096

    query_cache_size 16777216

    query_cache_type ON

    query_cache_wlock_invalidateOFF

    though I’ve got plenty of RAM to tweak this. I don’t want to just throw RAM at poorly optimised queries though.

    The simple query you mentioned takes 2 seconds for me (uncached). MATCH AGAINST with a full-text index takes 0.3 seconds. Keeping a full-text index only increased the database size by 30% in my case. That’s well worth while in the days where disk space is so cheap. The regular expression executes even quicker (0.2s), but as you say does not do partial words.

    There’s no reason why the core can’t have an option for full-text in the settings page, with two different search methods depending on what is chosen. It would only take 20 or so lines of code.


    _ck_
    Participant

    @_ck_

    Remember, the problem with MATCH AGAINST is it will not do partial words or words under 4 characters.

    It’s not too hard to replace the search, you just have to decide which way you want to go. The regular expression will at least do 3 character words which I find is more common than you’d imagine.

    The problem is that the time for any way without an index is going to increase dramatically once you start adding sorting and other options that cause full table scans. You can see this happen if you try to add a simple option to the regex demo like sorting by reverse post_id (which is a trick that should be a little faster than sorting by date).

    SELECT * FROM bb_posts WHERE post_text LIKE '%test%' LIMIT 5 ORDER BY post_id DESC;

    and

    SELECT * FROM bb_posts WHERE post_text LIKE '%test%' AND REGEXP ':<:%test%:>:' LIMIT 5 ORDER BY post_id DESC;

    You might want to test a worse case scenario by using three character nonsense words that will cause a cache-miss like “zzz” and “aaa”. Change them each time so mysql cannot cache the results and give you faster times.

    If the above example returns in an acceptable amount of time you can just replace bbpress’s built in search with that simple method. By parsing a query you can also AND words together instead of bbPress’s default OR which to me is incredibly annoying and useless (you’ll notice no major search engine like Yahoo or Google does OR by default).


    Mark Barnes
    Member

    @mark8barnes

    @sambauers: Let me just give you a couple of observations. Firstly, regarding the first search performed.

    SELECT p.*, 0 AS search_score, MAX(post_time) AS post_time FROM bb_posts AS p JOIN bb_topics as t ON ( t.topic_id = p.topic_id ) WHERE p.post_text LIKE '%test%' AND p.post_status = '0' AND t.topic_status = '0' GROUP BY t.topic_id ORDER BY p.post_time DESC LIMIT 5;

    The first thing that can be done here is remove the join and replace it with a sub-query. This way the sub-query can be cached, and speed is much improved. Here’s the idential query without the join:

    SELECT p.*, 0 AS search_score, MAX(post_time) AS post_time FROM bb_posts AS p WHERE p.post_text LIKE '%test%' AND p.post_status = '0' AND NOT p.topic_id IN (SELECT t.topic_id FROM bb_topics AS t WHERE t.topic_status <> '0') GROUP BY p.topic_id ORDER BY p.post_time DESC LIMIT 5

    That cuts execution time by 55% on my setup. If you have lots of deleted topics this might not apply, but if you have a huge database you ought to be cleaning the deletes out of it regularly anyway.

    _ck_ is also right that ordering by post_id is quicker than by post_time. 17% quicker in my case.


    Mark Barnes
    Member

    @mark8barnes

    Full text search is of course still quicker and getting rid of the GROUP BY helps even further. 66% quicker in my case. I’m going to use the following query as my search.

    SELECT p . * , MATCH (p.post_text) AGAINST ('test') AS
    search_score FROM bb_posts AS p WHERE MATCH
    (p.post_text) AGAINST ('test') AND p.post_status = '0'
    AND NOT p.topic_id IN (SELECT t.topic_id FROM bb_topics
    AS t WHERE t.topic_status <> '0') ORDER BY search_score
    DESC LIMIT 30;

    The downside is that it doesn’t include topic titles in the search. You can solve this adding this search to a temporary table, running another search on topic_title, and merging the two tables together. You need quite a bit of PHP code to manage this (particularly getting the NEXT buttons to work correctly), but it’s quite possible.


    _ck_
    Participant

    @_ck_

    Off the top of my head this looks like a very bad idea performance-wise:

    AND NOT p.topic_id IN (SELECT t.topic_id FROM bb_topics
    AS t WHERE t.topic_status <> '0')

    Though mysql may optimize it on the fly. Technically there could be thousands of deleted topics on an active site which makes quite a few items to search for (though in your case, there are NO deleted topics, yet, so this won’t be problem early on). However I think topic_status has an index? If not, it’s even worse.

    Alternatively you could “over-search” the results (ie. ask for 100 instead of 30) and reverse the check of the 100 topic_id’s to make sure topic_status=0.

    Since you’d end up using two queries, you could merge the two needs, to check titles AND to check topic status for the previously returned list of posts.

    You’re dead right about pagination becoming insanely complicated in this case. This could be solved by limiting result lengths to just 100 items max, and then use my “over-query” idea to ask for 200 just incase it hits alot of deletions.


    Mark Barnes
    Member

    @mark8barnes

    @_ck_

    Whilst “technically” there may be thousands of deleted/spam topics, most administrators of large databases would regularly prune them. (I’m writing a plugin to solve that, now cron is in 1.0alpha 2 – should be released in October.)


    _ck_
    Participant

    @_ck_

    Just remember that the cron in bbPress/WordPress is fake and not parallel tasking. Meaning that the user waits while the task is completed when the task is triggered. Unless they are using url fetching via php to do a pseudo multi-task, which I doubt.

    Sometimes (most times) a real cron is better. I asked them way-back-when cron was first introduced in WordPress to allow a real cron job to be substituted but still work with the API and I don’t believe they’ve ever done that.


    Mark Barnes
    Member

    @mark8barnes

    The blog says, “To enable cron I’ve included the very new WP_Http class in BackPress.” This class is documented here: http://codex.wordpress.org/Version_2.7#HTTP_API which I read as providing exactly what you suggested – URL fetching to do a pseudo multi-task.


    _ck_
    Participant

    @_ck_

    Ah, now I remember when they changed/fixed that in WordPress. Because my logs became full of the cron entries with WordPress pinging back to itself through the server. I think it polls like every few minutes too which was annoying. Hopefully that’s changed too or it will have to be deleted asap (along with xmlrpc).

    1.0 is going to be so much more bloated over 0.9

    Snoopy/http class alone is 40k.


    Sam Bauers
    Participant

    @sambauers

    We aren’t including Snoopy, only WP_Http (which is only around 24k of code if you discount the comments)

    The cron job is triggered by pageviews and then only if there are jobs to run.

    So on a visit to the site, the cron job list is checked (just an option in the meta table) then if there is something there to do, bbPress sends an HTTP request to itself, which the user doesn’t need to wait for.

Viewing 20 replies - 1 through 20 (of 20 total)

You must be logged in to reply to this topic.