Issue 2550788
Created on 2013-01-16 07:04 by jerome, last changed 2022-11-25 04:46 by rouilj.
File name |
Uploaded |
Description |
Edit |
Remove |
indexer_xapian.patch
|
ollydbg,
2016-02-25 05:21
|
patch to enable index chinese words by mmseg |
|
|
msg4755 |
Author: [hidden] (jerome) |
Date: 2013-01-16 07:04 |
|
Hi,
Roundup is a good issue tracking tool, it does what we need, thank
you all, here is an issue about the search page.
The "All text" search filter does not support non-ascii chars, but
the "title" filter does.
Currently, we can't force other to submit issue with english only,
so we often search with some Chinese phrase, title filter does support
Chinese, but "All text" not.
here is some chinese word, just for test
你好
谢谢
欢迎
|
msg4756 |
Author: [hidden] (jerome) |
Date: 2013-01-16 07:09 |
|
I test the search func on your site, it works well, fine...
Even so, could you show some hints on how to troubleshoot this issue?
Thanks!
|
msg4757 |
Author: [hidden] (ber) |
Date: 2013-01-16 09:27 |
|
Hi Jerome,
thanks for your report!
Let us try together to make it a reproduce report in the next steps,
this is a precondition for the developers to provide a solution.
Which version of roundup are you using? Platform? Backend and
Libraries? (This determines which search code is used.)
Can you try to give use some contents that should be searchable?
Maybe you could write a test case? (It is very easy look for the search
and index tests in the "test" subdirectory).
What is the behaviour you are getting? Some error message?
Regards,
Bernhard
|
msg4758 |
Author: [hidden] (jerome) |
Date: 2013-01-16 09:54 |
|
Hi ber
appreciation for your reply
OS: ubuntu10.04
roundup 1.4.11-1ubuntu1 an issue-tracking system
db: sqlite
python2.6 2.6.5-1ubuntu6
Can you try to give use some contents that should be searchable?
Maybe you could write a test case? (It is very easy look for the search
and index tests in the "test" subdirectory).
----->I could not understand what you said
we just serch chinese
What is the behaviour you are getting? Some error message?
no error information display, just nothing
we have some chinese phrase in content, but serch in "All text" it
does't word
thanks
|
msg4759 |
Author: [hidden] (ber) |
Date: 2013-01-16 10:55 |
|
Can you try a new version of roundup,
just to see if this causes your problem?
Should be easy. Just download and try in the demo tracker, see the
documentation or ask on the users- list.
|
msg5131 |
Author: [hidden] (yanqian) |
Date: 2014-08-17 11:04 |
|
This is a Chinese sentence sample, just for test purpose.
大家好我来自中国。
I think I find the way to reproduce the search function bug, Maybe I can't
search the Chinese word in the sample sentence.
|
msg5132 |
Author: [hidden] (yanqian) |
Date: 2014-08-17 11:13 |
|
OK, its behavior is same as what I expected.
These words can only be found when they are separated by spaces, that's
why I can got result when I search "你好" "谢谢" or "欢迎".
Usually we don't put unnecessay spaces in the normal sentence, so the
bug is that roundup is not able to search the Chinese word in the
sentence in "All text" search func.
But as I tested, it works well when I use the "title" search func, yes,
it do find match when I search Chinese word from the sentence in the
tilte area(without spaces).
So, How can we let roundup "All text" search behave in the same way as
"Title" search?
Thanks!
|
msg5134 |
Author: [hidden] (ber) |
Date: 2014-09-01 09:57 |
|
Hi,
the all text search works by using an index.
The algorithm that creates the index first separates the words by spaces
and then puts them in the index. Maybe there is a special treatment of
whitespace in chinese sentences.
To make life more difficult: there are several index "backends" that
roundup may by using. You could try using a different one to see if the
situation changes. E.g. try using xapian.
Best,
Bernhard
|
msg5404 |
Author: [hidden] (ThomasAH) |
Date: 2016-01-06 08:24 |
|
Roundup 1.5 with Xapian index and Postgres is still affected by this issue.
|
msg5405 |
Author: [hidden] (ThomasAH) |
Date: 2016-01-06 08:28 |
|
but ... searching for the strings mentioned in this tracker (e.g. 大家好
我来自中国) finds this issue!
Adding some additional non-ascii text to test search here:
äöüßə
|
msg5406 |
Author: [hidden] (ThomasAH) |
Date: 2016-01-06 08:34 |
|
Searching for "äöüßə" works fine here, too, while it doesn't in our
installation (Changeset 55aef7ab35a8, after 1.5) and I don't see any
relevant changes after this changeset.
So what is the difference?
|
msg5407 |
Author: [hidden] (ThomasAH) |
Date: 2016-01-06 08:39 |
|
In an older installation with roundup 1.4.11 and Postgres (but no Xapian
index) this works fine, too.
|
msg5408 |
Author: [hidden] (ThomasAH) |
Date: 2016-01-06 08:42 |
|
and in an even older installation (Roundup 1.3.x with sqlite backend, no
Xapian) it works fine, too.
Maybe this just happens with Xapian enabled?
|
msg5409 |
Author: [hidden] (ThomasAH) |
Date: 2016-01-06 09:14 |
|
I have just disabled Xapian for our new trackers, now searching for
non-ASCII characters works perfectly.
Of course the search might be slower now and Xapian might be needed for
other reasons, so this is just a workaround.
(additionally the only way to disable using Xapian is uninstalling the
python bindings for it, or changing Roundup's code)
|
msg5464 |
Author: [hidden] (ollydbg) |
Date: 2016-02-24 06:36 |
|
Hi, guys, I'm not sure what the exact "non-ascii" issue you have. For
me, the need is to just handle Chinese chars(also the English words is
still needed). I use the mmseg 1.3.0 : Python Package Index -
https://pypi.python.org/pypi/mmseg/1.3.0 to parse all the text, and
added the terms to the xapian's database. (Note that not only mmseg, but
other fxsjy/jieba: 结巴中文分词 - https://github.com/fxsjy/jieba should
works OK, because they just cut a long Chinese sentence to several
Chinese words)
Now, I can search the Chinese words correctly.
If you are interested to use mmseg, I can upload the patch against
roundup's source code.
|
msg5466 |
Author: [hidden] (ber) |
Date: 2016-02-24 15:50 |
|
Hi,
On Wednesday 24 February 2016 at 07:36:55, ollydbg wrote:
> If you are interested to use mmseg, I can upload the patch against
> roundup's source code.
it is always good to publish see the code, even if it does not
get merged in the end.
Thanks! :)
|
msg5467 |
Author: [hidden] (ollydbg) |
Date: 2016-02-25 05:21 |
|
Hi, Bernhard Reiter, see the patch file as attachment, this patch is
against the roundup 1.5.1 release source.
I just extract Chinese substring from the whole text, and split it to
several Chinese words by mmseg. Then I add each term for each Chinese
words. The whole text is still parsed by the default English indexer, so
that we can index all the Chinese and English words.
Please note that in the html file issue.item.html, the wrap option
should be (in the line 89)
wrap="soft"
Otherwise, the Chinese sentence will be divided mistakenly by the added
line breaks. See my report here: [Roundup-users] hard line breaks were
automatically added between Chinese chars when I add a message -
https://sourceforge.net/p/roundup/mailman/message/34879502/
|
msg5469 |
Author: [hidden] (ber) |
Date: 2016-02-25 08:09 |
|
Ollydbg,
thanks for publishing the patch!
At least it will be useful to Chinese users of roundup!
Best Regards,
Bernhard
|
msg5819 |
Author: [hidden] (rouilj) |
Date: 2016-07-10 19:05 |
|
If we want to add this to the core it looks like I need to:
change the
from mmseg.search import seg_txt_search,seg_txt_2_dict
to
try:
MissingMmseg=False
from mmseg.search import seg_txt_search,seg_txt_2_dict
except ImportError:
MissingMmseg=True
and then wrap the second section in if not MissingMmseg
Then add to installation.doc the package info and why people
would want it.
Does anybody see anything else I would need to incorporate this patch?
Does anybody believe this patch should not go into core?
Note I have no idea how to write a test for it and the test is
conditional on having and not having the mmseg module.
-- rouilj
|
msg5847 |
Author: [hidden] (ollydbg) |
Date: 2016-07-14 01:51 |
|
Hi, rouily, thanks, it's great if the official roundup could support
Chinese language indexing.
If you would like to support indexing the Chinese words, I would
strongly consider another change, see my comments in msg5467 -
http://issues.roundup-tracker.org/msg5467
Without this change, the edit box will cut the long Chinese sentence badly.
|
msg5848 |
Author: [hidden] (rouilj) |
Date: 2016-07-14 02:06 |
|
Ollydbg said:
> If you would like to support indexing the Chinese words, I would
> strongly consider another change, see my comments in msg5467 -
> http://issues.roundup-tracker.org/msg5467
We will probably need to have some sort of tracker config setting for
this. There is a related ticket to wrap long lines in emails.
http://issues.roundup-tracker.org/issue2550902
Setting wrap=soft would mean that the same long lines would be created
by the web interface.
|
msg7681 |
Author: [hidden] (rouilj) |
Date: 2022-11-25 04:46 |
|
The current 2.2.0 release also includes:
# Used to determine what language should be used by the
# indexer above. Applies to Xapian and PostgreSQL native-fts
# indexer. It sets the language for the stemmer, and PostgreSQL
# native-fts stopwords and other dictionaries.
# Possible values: must be a valid language for the indexer,
# see indexer documentation for details.
# Default: english
indexer_language = english
in config.ini. My suspicion is that this problem would be handled by using
the Chinese language/locale for xapian, but it might not be enough.
|
|
Date |
User |
Action |
Args |
2022-11-25 04:46:17 | rouilj | set | messages:
+ msg7681 |
2016-07-14 02:06:48 | rouilj | set | messages:
+ msg5848 |
2016-07-14 01:51:53 | ollydbg | set | messages:
+ msg5847 |
2016-07-10 19:05:30 | rouilj | set | nosy:
+ rouilj messages:
+ msg5819 |
2016-06-26 19:15:42 | rouilj | link | issue1238984 superseder |
2016-02-25 08:09:48 | ber | set | messages:
+ msg5469 |
2016-02-25 05:21:50 | ollydbg | set | files:
+ indexer_xapian.patch keywords:
+ patch messages:
+ msg5467 |
2016-02-24 15:50:38 | ber | set | messages:
+ msg5466 |
2016-02-24 06:36:55 | ollydbg | set | nosy:
+ ollydbg messages:
+ msg5464 |
2016-02-02 10:42:50 | pefu | set | nosy:
+ pefu |
2016-01-06 09:14:38 | ThomasAH | set | messages:
+ msg5409 title: Does not support non-ascii chars for All text search -> Does not support non-ascii chars for All text search (with Xapian) |
2016-01-06 08:42:10 | ThomasAH | set | messages:
+ msg5408 |
2016-01-06 08:39:19 | ThomasAH | set | messages:
+ msg5407 |
2016-01-06 08:34:46 | ThomasAH | set | messages:
+ msg5406 |
2016-01-06 08:28:33 | ThomasAH | set | messages:
+ msg5405 |
2016-01-06 08:24:59 | ThomasAH | set | nosy:
+ ThomasAH messages:
+ msg5404 versions:
+ 1.5 |
2014-09-01 09:57:06 | ber | set | messages:
+ msg5134 |
2014-08-17 11:13:31 | yanqian | set | messages:
+ msg5132 |
2014-08-17 11:04:24 | yanqian | set | nosy:
+ yanqian messages:
+ msg5131 |
2013-01-16 10:55:09 | ber | set | messages:
+ msg4759 |
2013-01-16 09:54:20 | jerome | set | messages:
+ msg4758 |
2013-01-16 09:27:23 | ber | set | nosy:
+ ber messages:
+ msg4757 |
2013-01-16 07:09:38 | jerome | set | messages:
+ msg4756 |
2013-01-16 07:04:05 | jerome | create | |
|