Roundup Tracker - Issues

Issue 2550583

classification
xapian search yields too few results
Type: behavior Severity: normal
Components: Web interface Versions: 1.4
process
Status: fixed fixed
:
: : ThomasAH, ber, bruce, jvstein, olly, wolever
Priority: high : patch

Created on 2009-09-01 09:54 by ThomasAH, last changed 2013-10-21 11:20 by ThomasAH.

Files
File name Uploaded Description Edit Remove
patch.diff jvstein, 2010-03-08 04:33 Patch to change Xapian indexer to use lowercase
patch.diff jvstein, 2010-03-08 04:44
Messages
msg3863 Author: [hidden] (ThomasAH) Date: 2009-09-01 09:54
When using Xapian for full text search, various problems show:
- only few hits are found
  indexer_xapian.py uses:
    matches = enquire.get_mset(0, 10)
  so at most 10 results are found.
  Additionally when looking at the Xapian API docs, I think the
  "checkatleast" parameter should be used:

  | the minimum number of items to check. Because the matcher optimises,
  | it won't consider every document which might match, so the total
  | number of matches is estimated. Setting checkatleast forces it to
  | consider at least this many matches and so allows for reliable paging
  | links.

  I read this as: search is unreliable if checkatleast is too low.

- it seems as if issues titles are not searched
  (though maybe this is just a symptom of above)

http://tracker.xemacs.org/XEmacs/its/issue501 describes this problem,
too, but I noticed it in our own installation.
msg3873 Author: [hidden] (olly) Date: 2009-09-10 09:41
No, checkatleast doesn't affect which matches are returned (or at least
if it does, that's a bug).  What it does is provide a way to improve the
accuracy of the estimated number of matches (in exchange for doing a bit
more work).

The wording isn't very clear - I'll improve it.

If it helps, there are some tips for debugging why matches you expect to
see aren't found here: http://trac.xapian.org/wiki/FAQ/NoMatches
msg4036 Author: [hidden] (jvstein) Date: 2010-03-08 04:33
I noticed that Xapian has some problems stemming uppercase strings.

>>> indexer = xapian.TermGenerator()
>>> stemmer = xapian.Stem("english")
>>> stemmer("SILENTLY")
'SILENTLi'
>>> stemmer("silently")
'silent'
>>> stemmer("organization")
'organ'
>>> stemmer("ORGANIZATION")
'ORGANIZATION'

This is probably contributing to the low search results. Patch is attached to switch the index 
to lowercase.
msg4037 Author: [hidden] (jvstein) Date: 2010-03-08 04:44
Uploaded newer version of patch that doesn't break the stop word list.
msg4057 Author: [hidden] (ber) Date: 2010-05-10 13:36
Jeff, thanks for your patch and the idea.
I've tried to contruct a test that shows that the stemming will 
make searches break. However as all words gets uppercased before
going into xapian, searching for "silently" and "SILENTLY" gives
the same result, no matter if you index "SILENTLY" or "silently" 
because the word getting in the index and getting asked from it 
is "SILENTLi". So it matches okay.

Of course it does not match searches like "silent" because stemming
does not work. 

Could you create a test that fails if we do not switch to lowercase?
Does Xapian recognise this stemming issue? Or do they recommend 
switching to lowercase always?
msg4058 Author: [hidden] (jvstein) Date: 2010-05-10 22:15
Bernhard,

From what I understand, Roundup uses the Porter2 stemming algorithm exposed by Xapian.
    http://snowball.tartarus.org/algorithms/english/stemmer.html

The original Porter algorithm requires lowercase input. Take a look at some of the reference 
implementations here.
    http://tartarus.org/~martin/PorterStemmer/

The only Xapian reference I found was on their intro page and is hardly prescriptive 
(http://xapian.org/docs/intro_ir.html).
   "Usually they are converted to lower case, and often a stemming algorithm is applied"

The problem is that stemming doesn't work properly. "Silently" should stem to "silent", not 
"SILENTLi". A search for "silently" should return pages that contain the word "silent" and vice 
versa.

A simple test would be to index a document containing the word "silently" and ensure that a 
search on the term "silent" returns the same document.

--Jeff
msg4059 Author: [hidden] (ber) Date: 2010-05-11 07:47
Jeff, 
thanks for the pointers. I see your point about stemming not working:
Thus a search for "silent" will not match an indexed word "silently"
like it should when stemming is used. 

Maybe this issue needs to be clarified to come up with examples
that we could fix. I see two classes of problems:

a) exact matching of words does not happen like it should with 
Xapian. (I believe this is what Thomas wanted to report about.)

b) matching stemmed word does not work, like it should with Xapian
and using its stemmer.

My point is that a) is not affected by the upper case stemming defect
as this will happen with all words and thus the exact match works.
(I tested this.)

Maybe we should open another issue about b)?
What I am sure about is that we should also file this with Xapian,
I guess they should add documentation to their API reference that 
stemmers are only to supposed to work with lower case, e.g. here  
http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html#e4b7a74ac5bd468beb4e6c55d776fba0
Otherwise it would be a defect.
msg4064 Author: [hidden] (wolever) Date: 2010-06-25 18:57
+1 on this. I've been frustrated by this bug for a while, and I came to the same implementation 
when trying to fix it.

It would be really nice if this could get put into mainline.
msg4065 Author: [hidden] (wolever) Date: 2010-06-25 19:00
However, this particular implementation isn't quite perfect either, as it doesn't preserve 
capitalization (and Xapian won't stem words with a leading capital — see "proper names", here: 
http://xapian.org/docs/queryparser.html).
msg4066 Author: [hidden] (wolever) Date: 2010-06-25 19:27
Additionally, it doesn't seem like the problem of "at most 10 results are found" has been 
addressed.
I can confirm that changing:
    matches = enquire.get_mset(0, 10)

To, for example:
    matches = enquire.get_mset(0, 100)

Will result in return more results.
msg4068 Author: [hidden] (ber) Date: 2010-06-28 08:41
Okay, I've split out Issue2550653 (xapian search, stemming is not 
working) now.

David, thanks for your remarks.
Can you be more explicit about which implementation you are suggesting?
I am willing to check stuff in, best would be to have a unit test first
that fails and then works with the fix.
I am also not quite sure about your comment in msg4065.
You are saying that capitals will not be preserved and that this
is correct?
msg4612 Author: [hidden] (ber) Date: 2012-08-21 08:07
Hmm Olly wrote that if changing matches = enquire.get_mset(0, 10)
to a higher number leads to more results, that a defect in Xapian.
So I guess this needs a retest, best would be an automatic test case,
which specific Xapian versions.

And then we may need to file and issue with Xapian.
msg4940 Author: [hidden] (ThomasAH) Date: 2013-10-21 11:13
https://sourceforge.net/p/roundup/code/ci/3ff1a288fb9c
changeset:   4841:3ff1a288fb9c
tag:         tip
user:        Thomas Arendsen Hein <thomas@intevation.de>
date:        Mon Oct 21 12:56:28 2013 +0200
summary:     issue2550583, issue2550635 Do not limit results with Xapian
indexer

(closing here, stemming is discussed in issue2550653)
History
Date User Action Args
2013-10-21 11:20:34ThomasAHsetstatus: new -> fixed
2013-10-21 11:13:34ThomasAHsetpriority: high
resolution: fixed
messages: + msg4940
2013-10-21 11:10:32ThomasAHlinkissue2550635 superseder
2012-08-21 08:07:45bersetmessages: + msg4612
2010-10-25 00:43:24brucesetnosy: + bruce
2010-06-28 08:41:02bersetmessages: + msg4068
2010-06-25 19:27:47woleversetmessages: + msg4066
2010-06-25 19:00:12woleversetmessages: + msg4065
2010-06-25 18:57:42woleversetnosy: + wolever
messages: + msg4064
2010-05-11 07:47:24bersetmessages: + msg4059
2010-05-10 22:15:48jvsteinsetmessages: + msg4058
2010-05-10 13:36:30bersetmessages: + msg4057
2010-03-08 04:44:48jvsteinsetmessages: + msg4037
2010-03-08 04:44:04jvsteinsetfiles: + patch.diff
2010-03-08 04:33:33jvsteinsetfiles: + patch.diff
keywords: + patch
messages: + msg4036
nosy: + jvstein
2009-09-11 12:42:17bersetnosy: + ber
2009-09-10 09:41:58ollysetnosy: + olly
messages: + msg3873
2009-09-01 09:54:10ThomasAHcreate