Roundup Tracker - Issues

Issue 780733

classification
full-text indexer isn't locale-aware
Type: rfe Severity: normal
Components: None Versions:
process
Status: closed fixed
:
: richard : a1s, baranenko, richard
Priority: normal :

Created on 2003-07-31 09:25 by baranenko, last changed 2005-05-18 05:44 by richard.

Messages
msg3236 Author: [hidden] (baranenko) Date: 2003-07-31 09:25
full-text search does not find international characters in 
the contents of the message.
i'm using 0.6.0b4
msg3237 Author: [hidden] (richard) Date: 2003-08-28 05:10
Logged In: YES 
user_id=6405

Yep, the text tokeniser is being dumb :( 
 
msg3238 Author: [hidden] (baranenko) Date: 2004-03-22 12:52
Logged In: YES 
user_id=801199

is there any chance it would be improved?
msg3239 Author: [hidden] (richard) Date: 2004-03-22 20:53
Logged In: YES 
user_id=6405

Sure, but I guess to do it we'd have to make the text 
tokeniser locale-aware. 
 
At the moment, the tokeniser splits text using  
 
re.findall(r'\b\w{2,25}\b', text) 
 
ie. a word is composed of anything that matches the RE 
module's \w 
msg3240 Author: [hidden] (a1s) Date: 2004-10-14 09:10
Logged In: YES 
user_id=8719

\w can match letters of any language with LOCALE and UNICODE
flags.

i'd prefer to use UNICODE.  to do that, we'll probably need
to allow 'charset' parameter for mime_type and have some
encoding (character set) preference that is used when
charset is not specified in mime_type.
msg3241 Author: [hidden] (richard) Date: 2005-05-18 05:44
Logged In: YES 
user_id=6405

See bug 1195739 
History
Date User Action Args
2003-07-31 09:25:35baranenkocreate