Roundup Tracker - Issues

Issue 2550583

classification
Title: xapian search yields too few results
Type: behavior Severity: normal
Components: Web interface Versions: 1.4
process
Status: new Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ThomasAH, ber, bruce, jvstein, olly, wolever
Priority: Keywords: patch

Created on 2009-09-01 09:54 by ThomasAH, last changed 2010-10-25 00:43 by bruce.

Files
File name Uploaded Description Edit Remove
patch.diff jvstein, 2010-03-08 04:33 Patch to change Xapian indexer to use lowercase
patch.diff jvstein, 2010-03-08 04:44
Messages
msg3863 (view) Author: [hidden] (ThomasAH) Date: 2009-09-01 09:54
When using Xapian for full text search, various problems show:
- only few hits are found
  indexer_xapian.py uses:
    matches = enquire.get_mset(0, 10)
  so at most 10 results are found.
  Additionally when looking at the Xapian API docs, I think the
  "checkatleast" parameter should be used:

  | the minimum number of items to check. Because the matcher optimises,
  | it won't consider every document which might match, so the total
  | number of matches is estimated. Setting checkatleast forces it to
  | consider at least this many matches and so allows for reliable paging
  | links.

  I read this as: search is unreliable if checkatleast is too low.

- it seems as if issues titles are not searched
  (though maybe this is just a symptom of above)

http://tracker.xemacs.org/XEmacs/its/issue501 describes this problem,
too, but I noticed it in our own installation.
msg3873 (view) Author: [hidden] (olly) Date: 2009-09-10 09:41
No, checkatleast doesn't affect which matches are returned (or at least
if it does, that's a bug).  What it does is provide a way to improve the
accuracy of the estimated number of matches (in exchange for doing a bit
more work).

The wording isn't very clear - I'll improve it.

If it helps, there are some tips for debugging why matches you expect to
see aren't found here: http://trac.xapian.org/wiki/FAQ/NoMatches
msg4036 (view) Author: [hidden] (jvstein) Date: 2010-03-08 04:33
I noticed that Xapian has some problems stemming uppercase strings.

>>> indexer = xapian.TermGenerator()
>>> stemmer = xapian.Stem("english")
>>> stemmer("SILENTLY")
'SILENTLi'
>>> stemmer("silently")
'silent'
>>> stemmer("organization")
'organ'
>>> stemmer("ORGANIZATION")
'ORGANIZATION'

This is probably contributing to the low search results. Patch is attached to switch the index 
to lowercase.
msg4037 (view) Author: [hidden] (jvstein) Date: 2010-03-08 04:44
Uploaded newer version of patch that doesn't break the stop word list.
msg4057 (view) Author: [hidden] (ber) Date: 2010-05-10 13:36
Jeff, thanks for your patch and the idea.
I've tried to contruct a test that shows that the stemming will 
make searches break. However as all words gets uppercased before
going into xapian, searching for "silently" and "SILENTLY" gives
the same result, no matter if you index "SILENTLY" or "silently" 
because the word getting in the index and getting asked from it 
is "SILENTLi". So it matches okay.

Of course it does not match searches like "silent" because stemming
does not work. 

Could you create a test that fails if we do not switch to lowercase?
Does Xapian recognise this stemming issue? Or do they recommend 
switching to lowercase always?
msg4058 (view) Author: [hidden] (jvstein) Date: 2010-05-10 22:15
Bernhard,

From what I understand, Roundup uses the Porter2 stemming algorithm exposed by Xapian.
    http://snowball.tartarus.org/algorithms/english/stemmer.html

The original Porter algorithm requires lowercase input. Take a look at some of the reference 
implementations here.
    http://tartarus.org/~martin/PorterStemmer/

The only Xapian reference I found was on their intro page and is hardly prescriptive 
(http://xapian.org/docs/intro_ir.html).
   "Usually they are converted to lower case, and often a stemming algorithm is applied"

The problem is that stemming doesn't work properly. "Silently" should stem to "silent", not 
"SILENTLi". A search for "silently" should return pages that contain the word "silent" and vice 
versa.

A simple test would be to index a document containing the word "silently" and ensure that a 
search on the term "silent" returns the same document.

--Jeff
msg4059 (view) Author: [hidden] (ber) Date: 2010-05-11 07:47
Jeff, 
thanks for the pointers. I see your point about stemming not working:
Thus a search for "silent" will not match an indexed word "silently"
like it should when stemming is used. 

Maybe this issue needs to be clarified to come up with examples
that we could fix. I see two classes of problems:

a) exact matching of words does not happen like it should with 
Xapian. (I believe this is what Thomas wanted to report about.)

b) matching stemmed word does not work, like it should with Xapian
and using its stemmer.

My point is that a) is not affected by the upper case stemming defect
as this will happen with all words and thus the exact match works.
(I tested this.)

Maybe we should open another issue about b)?
What I am sure about is that we should also file this with Xapian,
I guess they should add documentation to their API reference that 
stemmers are only to supposed to work with lower case, e.g. here  
http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html#e4b7a74ac5bd468beb4e6c55d776fba0
Otherwise it would be a defect.
msg4064 (view) Author: [hidden] (wolever) Date: 2010-06-25 18:57
+1 on this. I've been frustrated by this bug for a while, and I came to the same implementation 
when trying to fix it.

It would be really nice if this could get put into mainline.
msg4065 (view) Author: [hidden] (wolever) Date: 2010-06-25 19:00
However, this particular implementation isn't quite perfect either, as it doesn't preserve 
capitalization (and Xapian won't stem words with a leading capital — see "proper names", here: 
http://xapian.org/docs/queryparser.html).
msg4066 (view) Author: [hidden] (wolever) Date: 2010-06-25 19:27
Additionally, it doesn't seem like the problem of "at most 10 results are found" has been 
addressed.
I can confirm that changing:
    matches = enquire.get_mset(0, 10)

To, for example:
    matches = enquire.get_mset(0, 100)

Will result in return more results.
msg4068 (view) Author: [hidden] (ber) Date: 2010-06-28 08:41
Okay, I've split out Issue2550653 (xapian search, stemming is not 
working) now.

David, thanks for your remarks.
Can you be more explicit about which implementation you are suggesting?
I am willing to check stuff in, best would be to have a unit test first
that fails and then works with the fix.
I am also not quite sure about your comment in msg4065.
You are saying that capitals will not be preserved and that this
is correct?
History
Date User Action Args
2010-10-25 00:43:24brucesetnosy: + bruce
2010-06-28 08:41:02bersetmessages: + msg4068
2010-06-25 19:27:47woleversetmessages: + msg4066
2010-06-25 19:00:12woleversetmessages: + msg4065
2010-06-25 18:57:42woleversetnosy: + wolever
messages: + msg4064
2010-05-11 07:47:24bersetmessages: + msg4059
2010-05-10 22:15:48jvsteinsetmessages: + msg4058
2010-05-10 13:36:30bersetmessages: + msg4057
2010-03-08 04:44:48jvsteinsetmessages: + msg4037
2010-03-08 04:44:04jvsteinsetfiles: + patch.diff
2010-03-08 04:33:33jvsteinsetfiles: + patch.diff
keywords: + patch
messages: + msg4036
nosy: + jvstein
2009-09-11 12:42:17bersetnosy: + ber
2009-09-10 09:41:58ollysetnosy: + olly
messages: + msg3873
2009-09-01 09:54:10ThomasAHcreate