koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Koha-devel] DB design (MARC structure)


From: Paul POULAIN
Subject: Re: [Koha-devel] DB design (MARC structure)
Date: Mon Jun 21 02:51:02 2004
User-agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.6) Gecko/20040115

Joshua Ferraro a écrit :

Paul a écrit,

I'm still showing some problems with the searching speed with the latest
CVS ... maybe there are still some index problems. I tried an author search on "o'brian, patrick" and the result took about 10-15 seconds to return. Directly from mysql I get:

mysql> select distinct m1.bibid from biblio,biblioitems,marc_biblio,marc_word 
as m1,marc_word as m2,marc_word as m3 where 
biblio.biblionumber=marc_biblio.biblionumber and 
biblio.biblionumber=biblioitems.biblionumber and m1.bibid=marc_biblio.bibid and 
(m1.bibid=m2.bibid and m1.bibid=m3.bibid) and ((m1.word  like 'o%' and 
m1.tagsubfield in ('100a','110a', '700a', '710a'))and (m2.word like 'brian%' and 
m2.tagsubfield in('100a','110a', '700a', '710a'))and (m3.word like 'patrick%' and 
m3.tagsubfield in('100a','110a', '700a', '710a'))) order by biblio.title;

77 rows in set (5.34 sec)

here's the explain on that query:

mysql> explain select distinct m1.bibid from 
biblio,biblioitems,marc_biblio,marc_word as m1,marc_word as m2,marc_word as m3 
where biblio.biblionumber=marc_biblio.biblionumber and 
biblio.biblionumber=biblioitems.biblionumber and m1.bibid=marc_biblio.bibid and 
(m1.bibid=m2.bibid and m1.bibid=m3.bibid) and ((m1.word  like 'o%' and 
m1.tagsubfield in ('100a','110a', '700a', '710a'))and (m2.word like 'brian%' and 
m2.tagsubfield in('100a','110a', '700a', '710a'))and (m3.word like 'patrick%' and 
m3.tagsubfield in('100a','110a', '700a', '710a'))) order by biblio.title;
+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
| table       | type   | possible_keys          | key         | key_len | ref   
                   | rows | Extra                                        |
+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
| m2          | range  | bibid,word,Search_Marc | Search_Marc |     259 | NULL  
                   |  366 | Using where; Using temporary; Using filesort |
| m1          | ref    | bibid,word,Search_Marc | bibid       |       8 | 
m2.bibid                 |   56 | Using where                                  |
| marc_biblio | eq_ref | PRIMARY,biblionumber   | PRIMARY     |       8 | 
m1.bibid                 |    1 | Using where; Distinct                        |
| biblio      | eq_ref | PRIMARY,blbnoidx       | PRIMARY     |       4 | 
marc_biblio.biblionumber |    1 | Distinct                                     |
| biblioitems | ref    | bibnoidx               | bibnoidx    |       4 | 
biblio.biblionumber      |   12 | Using index; Distinct                        |
| m3          | ref    | bibid,word,Search_Marc | bibid       |       8 | 
m1.bibid                 |   20 | Using where; Distinct                        |
+-------------+--------+------------------------+-------------+---------+--------------------------+------+-----------------------------------------------+
6 rows in set (0.04 sec)

mysql>

So it still looks like we're using temp and filesort--which I assume is
causing the hangup ... if that is the best we can do we may need to start
thinking about breaking up marc_word into sections (e.g., marc_word_title;
marc_word_author, etc.).  The search is really accurate but just too slow
for a production database as large as ours.  What do folks think, would that
speed things up?
I think it's too complex to code. The DB would need a big rewritte & multi-marc support would be a pain.
Some ideas to continue speeding things :
* analyse the table : http://www.databasejournal.com/features/mysql/article.php/10897_1382791_3 * is your my.cnf correctly set ? (ie with big caches : "Eliminating the filesort to speed things up is best done by calculating how big your result set of this operation can become and then increase the server variable sort_buffer_size which is the maximum size that mysqld will keep in memory before using a filesort instead. Note that if your resultset is extremely big then you might consume more memory than is advisable and then things might slow down because of swaping" hint given here : http://forums.devshed.com/showthread.php?t=149288. Setting sort_buffer_size=16M seems to make my harddisk silent ;-) ) * change the index to make it "unique". The problem being to change the table definition (with a lot of values in it, is will be quite hard). Not sure of the speed improvement
* move temporary index to RAM (how ?)

* if we add I "limit 0,200" to the query, things are faster. Maybe we could add a "0,50" or some systempref value. * how many lines have you in your marc_word table ? mine is 9 000 000, and answers are faster than for you (around 5seconds), except when the result is >200 entries. * do you have in marc_word only tags that are interesting (= did you discard, for example, 0xx tags). I think yes, even if, for instance, it's not a standard Koha hack ;-) * multi-word is slower than single word (in your case o'brian is a 2 word seach). (note I'm still thinking it's a not a good idea to index 1 letter words). * An idea could be to split marc_word into 10 tables, one for each nXX tag (0xx, 1xx, 2xx...) I'm not sure we had a big improvement here, because for some searches, we could have to query a lot of differents tables. another problem is that some tables would be almost empty (like 0xx), and some would be huge (like 7xx probably). * Remove ordering of the result. It will remove the "using filesort", which is bad. which improvement does it give Joshua ? maybe we could order the result with a perl script AFTER building the result (I'm not sure it would be a good idea, as it's not compatible with any limit clause (& supposes to retrieve all the result list)

HTH

--
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]