PIM: hard-code collation (Pinyin, phonebook)

Patrick Ohly @pohly said:

Here are four names, one per line: Adams Jeffries 江 Meadows

江 has Jiang has Pinyin representation, so a collation based on Pinyin should sort as shown above (江 = Jiang after Jeffries and before Meadows). At least that's my understanding.

Unfortunately, I cannot reproduce this with the ICU web tool: http://demo.icu-project.org/icu-bin/locexp?_=zh&d_=en&x=col&collation=pinyin

To reproduce, replace the "Source" text with the names above and hit "sort". I get: 江 Adams Jeffries Meadows

Selecting and deselecting "Pinyin" as sort order has an effect. With the default sort order, 江 comes last.

Either the expected ordering above is wrong, ICU doesn't work as expected, or there is a bug in it (not likely?!).

Patrick Ohly @pohly said:

Need help by a localization expert. I've contacted some colleagues in Intel working on that.

Patrick Ohly @pohly said:

(In reply to comment 1)

Here are four names, one per line: Adams Jeffries 江 Meadows

江 has Jiang has Pinyin representation, so a collation based on Pinyin should sort as shown above (江 = Jiang after Jeffries and before Meadows). At least that's my understanding.

A Chinese colleague confirmed that this is indeed what he expects.

From the icu-support mailing list:

From: Mark Davis mark@macchiato.com Reply-to: ICU support mailing list icu-support@lists.sourceforge.net To: ICU support mailing list icu-support@lists.sourceforge.net Subject: Re: [icu-support] pinyin sorting in zh_CN.UTF-8 Date: Mon, 13 May 2013 13:02:11 +0200

People have different expectations for pinyin. Some possibilities are: Sort Chinese characters in pinyin order, but separate from Latin Sort them interleaved with Latin, by the first character. Sort them fully interleaved with Latin. For #2, the easiest way to do it is with the Alphabetic index. For #3, the best is to use a Han-Latin transliterator to get a key, then sort by that key.

We now know that ICU implements option 1, so implementing the expected outcome will be more work. We also need to determine whether #2 or #3 are expected.

Murray Cumming said:

A Chinese colleague confirmed that this is indeed what he expects. [snip]

It would be nice if we could base this on some standard that's written down somewhere, or more thoroughly documented as being de-facto common.

We now know that ICU implements option 1, so implementing the expected outcome will be more work. We also need to determine whether #2 or #3 are expected.

It seems a little odd that ICU doesn't do something is apparently so common.

Patrick Ohly @pohly said:

(In reply to comment 4)

A Chinese colleague confirmed that this is indeed what he expects. [snip]

It would be nice if we could base this on some standard that's written down somewhere, or more thoroughly documented as being de-facto common.

I suspect that there is no such document.

We now know that ICU implements option 1, so implementing the expected outcome will be more work. We also need to determine whether #2 or #3 are expected.

It seems a little odd that ICU doesn't do something is apparently so common.

My understanding is that all three options are valid, so ICU simply picked one. Perhaps they didn't pick the most popular one.

Patrick Ohly @pohly said:

LocaleFactoryBoost::genLocale() implements a hard-coded list of languages where "phonebook" collation is desirable. Currently this is "de" and "fi". We could use it in all cases, except that ICU has a bug where it does not fall back properly to the base collation. See http://sourceforge.net/mailarchive/message.php?msg_id=30802924 and http://bugs.icu-project.org/trac/ticket/10149

In addition, fully interleaved Pinyin-based sorting is used for "zh". This requires an extra transliteration of Han->Latin, because ICU itself sorts Chinese characters after Latin ones when using the "Pinyin" collation.

EDS implements the same logic in the new ECollator utility class, scheduled for EDS 3.10 and included in the openismus-work-3-8 branch. SyncEvolution's PIM Manager should use these classes.

Patrick Ohly @pohly said:

(In reply to comment 6)

EDS implements the same logic in the new ECollator utility class, scheduled for EDS 3.10 and included in the openismus-work-3-8 branch. SyncEvolution's PIM Manager should use these classes.

The current EDS APIs lead to a slight performance degradation: ICU uses std::string, EDS copys into string, SyncEvolution recreates a std::string. A C++ API in EDS using std::string would be more useful.

For performance reasons I kept the code which uses ICU directly.

PIM: hard-code collation (Pinyin, phonebook)

Submitted by Patrick Ohly `@pohly`

Description

Blocking

Designs

Child items ...

Activity

Admin message

Admin message

PIM: hard-code collation (Pinyin, phonebook)

Submitted by Patrick Ohly @pohly

Description

Blocking

Activity

Submitted by Patrick Ohly `@pohly`