Issue 780733
Created on 2003-07-31 09:25 by baranenko, last changed 2005-05-18 05:44 by richard.
msg3236 |
Author: [hidden] (baranenko) |
Date: 2003-07-31 09:25 |
|
full-text search does not find international characters in
the contents of the message.
i'm using 0.6.0b4
|
msg3237 |
Author: [hidden] (richard) |
Date: 2003-08-28 05:10 |
|
Logged In: YES
user_id=6405
Yep, the text tokeniser is being dumb :(
|
msg3238 |
Author: [hidden] (baranenko) |
Date: 2004-03-22 12:52 |
|
Logged In: YES
user_id=801199
is there any chance it would be improved?
|
msg3239 |
Author: [hidden] (richard) |
Date: 2004-03-22 20:53 |
|
Logged In: YES
user_id=6405
Sure, but I guess to do it we'd have to make the text
tokeniser locale-aware.
At the moment, the tokeniser splits text using
re.findall(r'\b\w{2,25}\b', text)
ie. a word is composed of anything that matches the RE
module's \w
|
msg3240 |
Author: [hidden] (a1s) |
Date: 2004-10-14 09:10 |
|
Logged In: YES
user_id=8719
\w can match letters of any language with LOCALE and UNICODE
flags.
i'd prefer to use UNICODE. to do that, we'll probably need
to allow 'charset' parameter for mime_type and have some
encoding (character set) preference that is used when
charset is not specified in mime_type.
|
msg3241 |
Author: [hidden] (richard) |
Date: 2005-05-18 05:44 |
|
Logged In: YES
user_id=6405
See bug 1195739
|
|
Date |
User |
Action |
Args |
2003-07-31 09:25:35 | baranenko | create | |
|