Issue 2550583
Created on 2009-09-01 09:54 by ThomasAH, last changed 2013-10-21 11:20 by ThomasAH.
File name |
Uploaded |
Description |
Edit |
Remove |
patch.diff
|
jvstein,
2010-03-08 04:33
|
Patch to change Xapian indexer to use lowercase |
|
|
patch.diff
|
jvstein,
2010-03-08 04:44
|
|
|
|
msg3863 |
Author: [hidden] (ThomasAH) |
Date: 2009-09-01 09:54 |
|
When using Xapian for full text search, various problems show:
- only few hits are found
indexer_xapian.py uses:
matches = enquire.get_mset(0, 10)
so at most 10 results are found.
Additionally when looking at the Xapian API docs, I think the
"checkatleast" parameter should be used:
| the minimum number of items to check. Because the matcher optimises,
| it won't consider every document which might match, so the total
| number of matches is estimated. Setting checkatleast forces it to
| consider at least this many matches and so allows for reliable paging
| links.
I read this as: search is unreliable if checkatleast is too low.
- it seems as if issues titles are not searched
(though maybe this is just a symptom of above)
http://tracker.xemacs.org/XEmacs/its/issue501 describes this problem,
too, but I noticed it in our own installation.
|
msg3873 |
Author: [hidden] (olly) |
Date: 2009-09-10 09:41 |
|
No, checkatleast doesn't affect which matches are returned (or at least
if it does, that's a bug). What it does is provide a way to improve the
accuracy of the estimated number of matches (in exchange for doing a bit
more work).
The wording isn't very clear - I'll improve it.
If it helps, there are some tips for debugging why matches you expect to
see aren't found here: http://trac.xapian.org/wiki/FAQ/NoMatches
|
msg4036 |
Author: [hidden] (jvstein) |
Date: 2010-03-08 04:33 |
|
I noticed that Xapian has some problems stemming uppercase strings.
>>> indexer = xapian.TermGenerator()
>>> stemmer = xapian.Stem("english")
>>> stemmer("SILENTLY")
'SILENTLi'
>>> stemmer("silently")
'silent'
>>> stemmer("organization")
'organ'
>>> stemmer("ORGANIZATION")
'ORGANIZATION'
This is probably contributing to the low search results. Patch is attached to switch the index
to lowercase.
|
msg4037 |
Author: [hidden] (jvstein) |
Date: 2010-03-08 04:44 |
|
Uploaded newer version of patch that doesn't break the stop word list.
|
msg4057 |
Author: [hidden] (ber) |
Date: 2010-05-10 13:36 |
|
Jeff, thanks for your patch and the idea.
I've tried to contruct a test that shows that the stemming will
make searches break. However as all words gets uppercased before
going into xapian, searching for "silently" and "SILENTLY" gives
the same result, no matter if you index "SILENTLY" or "silently"
because the word getting in the index and getting asked from it
is "SILENTLi". So it matches okay.
Of course it does not match searches like "silent" because stemming
does not work.
Could you create a test that fails if we do not switch to lowercase?
Does Xapian recognise this stemming issue? Or do they recommend
switching to lowercase always?
|
msg4058 |
Author: [hidden] (jvstein) |
Date: 2010-05-10 22:15 |
|
Bernhard,
From what I understand, Roundup uses the Porter2 stemming algorithm exposed by Xapian.
http://snowball.tartarus.org/algorithms/english/stemmer.html
The original Porter algorithm requires lowercase input. Take a look at some of the reference
implementations here.
http://tartarus.org/~martin/PorterStemmer/
The only Xapian reference I found was on their intro page and is hardly prescriptive
(http://xapian.org/docs/intro_ir.html).
"Usually they are converted to lower case, and often a stemming algorithm is applied"
The problem is that stemming doesn't work properly. "Silently" should stem to "silent", not
"SILENTLi". A search for "silently" should return pages that contain the word "silent" and vice
versa.
A simple test would be to index a document containing the word "silently" and ensure that a
search on the term "silent" returns the same document.
--Jeff
|
msg4059 |
Author: [hidden] (ber) |
Date: 2010-05-11 07:47 |
|
Jeff,
thanks for the pointers. I see your point about stemming not working:
Thus a search for "silent" will not match an indexed word "silently"
like it should when stemming is used.
Maybe this issue needs to be clarified to come up with examples
that we could fix. I see two classes of problems:
a) exact matching of words does not happen like it should with
Xapian. (I believe this is what Thomas wanted to report about.)
b) matching stemmed word does not work, like it should with Xapian
and using its stemmer.
My point is that a) is not affected by the upper case stemming defect
as this will happen with all words and thus the exact match works.
(I tested this.)
Maybe we should open another issue about b)?
What I am sure about is that we should also file this with Xapian,
I guess they should add documentation to their API reference that
stemmers are only to supposed to work with lower case, e.g. here
http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html#e4b7a74ac5bd468beb4e6c55d776fba0
Otherwise it would be a defect.
|
msg4064 |
Author: [hidden] (wolever) |
Date: 2010-06-25 18:57 |
|
+1 on this. I've been frustrated by this bug for a while, and I came to the same implementation
when trying to fix it.
It would be really nice if this could get put into mainline.
|
msg4065 |
Author: [hidden] (wolever) |
Date: 2010-06-25 19:00 |
|
However, this particular implementation isn't quite perfect either, as it doesn't preserve
capitalization (and Xapian won't stem words with a leading capital — see "proper names", here:
http://xapian.org/docs/queryparser.html).
|
msg4066 |
Author: [hidden] (wolever) |
Date: 2010-06-25 19:27 |
|
Additionally, it doesn't seem like the problem of "at most 10 results are found" has been
addressed.
I can confirm that changing:
matches = enquire.get_mset(0, 10)
To, for example:
matches = enquire.get_mset(0, 100)
Will result in return more results.
|
msg4068 |
Author: [hidden] (ber) |
Date: 2010-06-28 08:41 |
|
Okay, I've split out Issue2550653 (xapian search, stemming is not
working) now.
David, thanks for your remarks.
Can you be more explicit about which implementation you are suggesting?
I am willing to check stuff in, best would be to have a unit test first
that fails and then works with the fix.
I am also not quite sure about your comment in msg4065.
You are saying that capitals will not be preserved and that this
is correct?
|
msg4612 |
Author: [hidden] (ber) |
Date: 2012-08-21 08:07 |
|
Hmm Olly wrote that if changing matches = enquire.get_mset(0, 10)
to a higher number leads to more results, that a defect in Xapian.
So I guess this needs a retest, best would be an automatic test case,
which specific Xapian versions.
And then we may need to file and issue with Xapian.
|
msg4940 |
Author: [hidden] (ThomasAH) |
Date: 2013-10-21 11:13 |
|
https://sourceforge.net/p/roundup/code/ci/3ff1a288fb9c
changeset: 4841:3ff1a288fb9c
tag: tip
user: Thomas Arendsen Hein <thomas@intevation.de>
date: Mon Oct 21 12:56:28 2013 +0200
summary: issue2550583, issue2550635 Do not limit results with Xapian
indexer
(closing here, stemming is discussed in issue2550653)
|
|
Date |
User |
Action |
Args |
2013-10-21 11:20:34 | ThomasAH | set | status: new -> fixed |
2013-10-21 11:13:34 | ThomasAH | set | priority: high resolution: fixed messages:
+ msg4940 |
2013-10-21 11:10:32 | ThomasAH | link | issue2550635 superseder |
2012-08-21 08:07:45 | ber | set | messages:
+ msg4612 |
2010-10-25 00:43:24 | bruce | set | nosy:
+ bruce |
2010-06-28 08:41:02 | ber | set | messages:
+ msg4068 |
2010-06-25 19:27:47 | wolever | set | messages:
+ msg4066 |
2010-06-25 19:00:12 | wolever | set | messages:
+ msg4065 |
2010-06-25 18:57:42 | wolever | set | nosy:
+ wolever messages:
+ msg4064 |
2010-05-11 07:47:24 | ber | set | messages:
+ msg4059 |
2010-05-10 22:15:48 | jvstein | set | messages:
+ msg4058 |
2010-05-10 13:36:30 | ber | set | messages:
+ msg4057 |
2010-03-08 04:44:48 | jvstein | set | messages:
+ msg4037 |
2010-03-08 04:44:04 | jvstein | set | files:
+ patch.diff |
2010-03-08 04:33:33 | jvstein | set | files:
+ patch.diff keywords:
+ patch messages:
+ msg4036 nosy:
+ jvstein |
2009-09-11 12:42:17 | ber | set | nosy:
+ ber |
2009-09-10 09:41:58 | olly | set | nosy:
+ olly messages:
+ msg3873 |
2009-09-01 09:54:10 | ThomasAH | create | |
|