Issue 780733: full-text indexer isn't locale-aware - Roundup tracker

classification

Title:	full-text indexer isn't locale-aware
Type:	rfe	Severity:	normal
Components:	None	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies		Superseder:
Assigned To:	richard	Nosy List:	a1s, baranenko, richard
Priority:	normal	Keywords:

Created on 2003-07-31 09:25 by baranenko, last changed 2005-05-18 05:44 by richard.

Messages
msg3236	Author: [hidden] (baranenko)	Date: 2003-07-31 09:25
full-text search does not find international characters in the contents of the message. i'm using 0.6.0b4
msg3237	Author: [hidden] (richard)	Date: 2003-08-28 05:10
Logged In: YES user_id=6405 Yep, the text tokeniser is being dumb :(
msg3238	Author: [hidden] (baranenko)	Date: 2004-03-22 12:52
Logged In: YES user_id=801199 is there any chance it would be improved?
msg3239	Author: [hidden] (richard)	Date: 2004-03-22 20:53
Logged In: YES user_id=6405 Sure, but I guess to do it we'd have to make the text tokeniser locale-aware. At the moment, the tokeniser splits text using re.findall(r'\b\w{2,25}\b', text) ie. a word is composed of anything that matches the RE module's \w
msg3240	Author: [hidden] (a1s)	Date: 2004-10-14 09:10
Logged In: YES user_id=8719 \w can match letters of any language with LOCALE and UNICODE flags. i'd prefer to use UNICODE. to do that, we'll probably need to allow 'charset' parameter for mime_type and have some encoding (character set) preference that is used when charset is not specified in mime_type.
msg3241	Author: [hidden] (richard)	Date: 2005-05-18 05:44
Logged In: YES user_id=6405 See bug 1195739

History
Date	User	Action	Args
2003-07-31 09:25:35	baranenko	create