Roundup Tracker - Issues

Issue 2550788

classification
Does not support non-ascii chars for All text search (with Xapian)
Type: behavior Severity: normal
Components: Web interface Versions: 1.5, 1.4
process
Status: new
:
: : ThomasAH, ber, jerome, ollydbg, pefu, rouilj, yanqian
Priority: : patch

Created on 2013-01-16 07:04 by jerome, last changed 2022-11-25 04:46 by rouilj.

Files
File name Uploaded Description Edit Remove
indexer_xapian.patch ollydbg, 2016-02-25 05:21 patch to enable index chinese words by mmseg
Messages
msg4755 Author: [hidden] (jerome) Date: 2013-01-16 07:04
Hi,
    Roundup is a good issue tracking tool, it does what we need, thank 
you all, here is an issue about the search page.

    The "All text" search filter does not support non-ascii chars, but 
the "title" filter does.
    Currently, we can't force other to submit issue with english only, 
so we often search with some Chinese phrase, title filter does support 
Chinese, but "All text" not.
    here is some chinese word, just for test
    你好
     谢谢
     欢迎
msg4756 Author: [hidden] (jerome) Date: 2013-01-16 07:09
I test the search func on your site, it works well, fine...
Even so, could you show some hints on how to troubleshoot this issue?

Thanks!
msg4757 Author: [hidden] (ber) Date: 2013-01-16 09:27
Hi Jerome,

thanks for your report!
Let us try together to make it a reproduce report in the next steps,
this is a precondition for the developers to provide a solution.

Which version of roundup are you using? Platform? Backend and
Libraries? (This determines which search code is used.)

Can you try to give use some contents that should be searchable?
Maybe you could write a test case? (It is very easy look for the search
and index tests in the "test" subdirectory).

What is the behaviour you are getting? Some error message?

Regards,
Bernhard
msg4758 Author: [hidden] (jerome) Date: 2013-01-16 09:54
Hi ber



appreciation for your reply



OS:          ubuntu10.04

roundup      1.4.11-1ubuntu1 an issue-tracking system

db:          sqlite

python2.6    2.6.5-1ubuntu6



Can you try to give use some contents that should be searchable?

Maybe you could write a test case? (It is very easy look for the search

and index tests in the "test" subdirectory).

----->I could not understand what you said

we just serch chinese 



What is the behaviour you are getting? Some error message?

no error information display, just nothing



we have some chinese phrase in content, but serch in "All text" it 
does't word



thanks
msg4759 Author: [hidden] (ber) Date: 2013-01-16 10:55
Can you try a new version of roundup, 
just to see if this causes your problem?

Should be easy. Just download and try in the demo tracker, see the 
documentation or ask on the users- list.
msg5131 Author: [hidden] (yanqian) Date: 2014-08-17 11:04
This is a Chinese sentence sample, just for test purpose.

大家好我来自中国。

I think I find the way to reproduce the search function bug, Maybe I can't 
search the Chinese word in the sample sentence.
msg5132 Author: [hidden] (yanqian) Date: 2014-08-17 11:13
OK, its behavior is same as what I expected.
These words can only be found when they are separated by spaces, that's 
why I can got result when I search "你好" "谢谢" or "欢迎".

Usually we don't put unnecessay spaces in the normal sentence, so the 
bug is that roundup is not able to search the Chinese word in the 
sentence in "All text" search func.

But as I tested, it works well when I use the "title" search func, yes, 
it do find match when I search Chinese word from the sentence in the 
tilte area(without spaces).

So, How can we let roundup "All text" search behave in the same way as 
"Title" search?

Thanks!
msg5134 Author: [hidden] (ber) Date: 2014-09-01 09:57
Hi,
the all text search works by using an index.

The algorithm that creates the index first separates the words by spaces
and then puts them in the index. Maybe there is a special treatment of 
whitespace in chinese sentences.

To make life more difficult: there are several index "backends" that 
roundup may by using. You could try using a different one to see if the 
situation changes. E.g. try using xapian.

Best,
Bernhard
msg5404 Author: [hidden] (ThomasAH) Date: 2016-01-06 08:24
Roundup 1.5 with Xapian index and Postgres is still affected by this issue.
msg5405 Author: [hidden] (ThomasAH) Date: 2016-01-06 08:28
but ... searching for the strings mentioned in this tracker (e.g. 大家好
我来自中国) finds this issue!

Adding some additional non-ascii text to test search here:
äöüßə
msg5406 Author: [hidden] (ThomasAH) Date: 2016-01-06 08:34
Searching for "äöüßə" works fine here, too, while it doesn't in our
installation (Changeset 55aef7ab35a8, after 1.5) and I don't see any
relevant changes after this changeset.

So what is the difference?
msg5407 Author: [hidden] (ThomasAH) Date: 2016-01-06 08:39
In an older installation with roundup 1.4.11 and Postgres (but no Xapian
index) this works fine, too.
msg5408 Author: [hidden] (ThomasAH) Date: 2016-01-06 08:42
and in an even older installation (Roundup 1.3.x with sqlite backend, no
Xapian) it works fine, too.

Maybe this just happens with Xapian enabled?
msg5409 Author: [hidden] (ThomasAH) Date: 2016-01-06 09:14
I have just disabled Xapian for our new trackers, now searching for
non-ASCII characters works perfectly.

Of course the search might be slower now and Xapian might be needed for
other reasons, so this is just a workaround.
(additionally the only way to disable using Xapian is uninstalling the
python bindings for it, or changing Roundup's code)
msg5464 Author: [hidden] (ollydbg) Date: 2016-02-24 06:36
Hi, guys, I'm not sure what the exact "non-ascii" issue you have. For
me, the need is to just handle Chinese chars(also the English words is
still needed). I use the mmseg 1.3.0 : Python Package Index -
https://pypi.python.org/pypi/mmseg/1.3.0 to parse all the text, and
added the terms to the xapian's database. (Note that not only mmseg, but
other fxsjy/jieba: 结巴中文分词 - https://github.com/fxsjy/jieba should
works OK, because they just cut a long Chinese sentence to several
Chinese words)

Now, I can search the Chinese words correctly.

If you are interested to use mmseg, I can upload the patch against
roundup's source code.
msg5466 Author: [hidden] (ber) Date: 2016-02-24 15:50
Hi,

On Wednesday 24 February 2016 at 07:36:55, ollydbg wrote:
> If you are interested to use mmseg, I can upload the patch against
> roundup's source code.

it is always good to publish see the code, even if it does not
get merged in the end. 

Thanks! :)
msg5467 Author: [hidden] (ollydbg) Date: 2016-02-25 05:21
Hi, Bernhard Reiter, see the patch file as attachment, this patch is
against the roundup 1.5.1 release source.

I just extract Chinese substring from the whole text, and split it to
several Chinese words by mmseg. Then I add each term for each Chinese
words. The whole text is still parsed by the default English indexer, so
that we can index all the Chinese and English words.

Please note that in the html file issue.item.html, the wrap option
should be (in the line 89)

wrap="soft"

Otherwise, the Chinese sentence will be divided mistakenly by the added
line breaks. See my report here: [Roundup-users] hard line breaks were
automatically added between Chinese chars when I add a message -
https://sourceforge.net/p/roundup/mailman/message/34879502/
msg5469 Author: [hidden] (ber) Date: 2016-02-25 08:09
Ollydbg,
thanks for publishing the patch!

At least it will be useful to Chinese users of roundup!

Best Regards,
Bernhard
msg5819 Author: [hidden] (rouilj) Date: 2016-07-10 19:05
If we want to add this to the core it looks like I need to:

change the 

from mmseg.search import seg_txt_search,seg_txt_2_dict

to

try:
  MissingMmseg=False
  from mmseg.search import seg_txt_search,seg_txt_2_dict
except ImportError:
  MissingMmseg=True

and then wrap the second section in if not MissingMmseg

Then add to installation.doc the package info and why people
would want it.

Does anybody see anything else I would need to incorporate this patch?

Does anybody believe this patch should not go into core?

Note I have no idea how to write a test for it and the test is
conditional on having and not having the mmseg module.

-- rouilj
msg5847 Author: [hidden] (ollydbg) Date: 2016-07-14 01:51
Hi, rouily, thanks, it's great if the official roundup could support
Chinese language indexing.

If you would like to support indexing the Chinese words, I would
strongly consider another change, see my comments in msg5467 -
http://issues.roundup-tracker.org/msg5467

Without this change, the edit box will cut the long Chinese sentence badly.

msg5848 Author: [hidden] (rouilj) Date: 2016-07-14 02:06
Ollydbg said:

> If you would like to support indexing the Chinese words, I would
> strongly consider another change, see my comments in msg5467 -
> http://issues.roundup-tracker.org/msg5467

We will probably need to have some sort of tracker config setting for
this. There is a related ticket to wrap long lines in emails.

 http://issues.roundup-tracker.org/issue2550902

Setting wrap=soft would mean that the same long lines would be created
by the web interface.

msg7681 Author: [hidden] (rouilj) Date: 2022-11-25 04:46
The current 2.2.0 release also includes:

# Used to determine what language should be used by the
# indexer above. Applies to Xapian and PostgreSQL native-fts
# indexer. It sets the language for the stemmer, and PostgreSQL
# native-fts stopwords and other dictionaries.
# Possible values: must be a valid language for the indexer,
# see indexer documentation for details.
# Default: english
indexer_language = english

in config.ini. My suspicion is that this problem would be handled by using
the Chinese language/locale for xapian, but it might not be enough.
History
Date User Action Args
2022-11-25 04:46:17rouiljsetmessages: + msg7681
2016-07-14 02:06:48rouiljsetmessages: + msg5848
2016-07-14 01:51:53ollydbgsetmessages: + msg5847
2016-07-10 19:05:30rouiljsetnosy: + rouilj
messages: + msg5819
2016-06-26 19:15:42rouiljlinkissue1238984 superseder
2016-02-25 08:09:48bersetmessages: + msg5469
2016-02-25 05:21:50ollydbgsetfiles: + indexer_xapian.patch
keywords: + patch
messages: + msg5467
2016-02-24 15:50:38bersetmessages: + msg5466
2016-02-24 06:36:55ollydbgsetnosy: + ollydbg
messages: + msg5464
2016-02-02 10:42:50pefusetnosy: + pefu
2016-01-06 09:14:38ThomasAHsetmessages: + msg5409
title: Does not support non-ascii chars for All text search -> Does not support non-ascii chars for All text search (with Xapian)
2016-01-06 08:42:10ThomasAHsetmessages: + msg5408
2016-01-06 08:39:19ThomasAHsetmessages: + msg5407
2016-01-06 08:34:46ThomasAHsetmessages: + msg5406
2016-01-06 08:28:33ThomasAHsetmessages: + msg5405
2016-01-06 08:24:59ThomasAHsetnosy: + ThomasAH
messages: + msg5404
versions: + 1.5
2014-09-01 09:57:06bersetmessages: + msg5134
2014-08-17 11:13:31yanqiansetmessages: + msg5132
2014-08-17 11:04:24yanqiansetnosy: + yanqian
messages: + msg5131
2013-01-16 10:55:09bersetmessages: + msg4759
2013-01-16 09:54:20jeromesetmessages: + msg4758
2013-01-16 09:27:23bersetnosy: + ber
messages: + msg4757
2013-01-16 07:09:38jeromesetmessages: + msg4756
2013-01-16 07:04:05jeromecreate