Strange typing data results in mixing Traditional / Simplified variants
There are some strange mappings In the existing data that would lead to mixed Traditional / Simplified variants output.
For example, although "后" is the simplified variant of "後", users would expect,
- "hovie" / "竹人女戈水" to mapped only "後"
- "hmr" / "竹一口" to mapped only to "后"
However, in the current table.txt,
- "hovie" / "竹人女戈水" is mapped to both "後" and "后"
- "hmr" / "竹一口" is mapped to both "後" and "后"
I'd suggest that these strange mappings are related to the data source we used to compose "table.txt". There were 8 source data files in the original libcangjie repository. 4 for each Cangjie versions:
- 三倉繁體 (
cj3-13053.txt
orcj3-tc.txt
) - 三仓简体 (
cj3-6763.txt
orcj3-sc.txt
) - 三倉通用 (
cj3-20902.txt
orcj3-cc.txt
) - 三倉世紀 (
cj3-20902.txt
orcj3-cjk.txt
) - 五倉繁體 (
cj5-13053.txt
orcj5-tc.txt
) - 五仓简体 (
cj5-8300.txt
orcj5-sc.txt
) - 五倉通用 (
cj5-20902.txt
orcj5-cc.txt
) - 五倉世紀 (
cj5-20902.txt
orcj5-cjk.txt
)
These files are from the 倉頡平台2012 (can be translated as Cangjie Input Platform 2012, let's call it "CIP2012" for now). CIP2012 provides 8 Cangjie variant input methods. Each of the above files are directly corresponding to these variants. And the mappings of these data files are closely related to the feature that CIP2012 wants to provide.
The features of CIP2012 are listed in both a post in CBF's guestbook and a post in ChineseCJ.com. I need to highlight one of the listed features:
支持打簡出繁及打繁出簡
(translation: support "typing Simplified output Traditional" and "typing Traditional output Simplified")
Apparently, this feature is achieved by adding cross mapping data. Some Cangjie codes are mapped not only to the direct match, but also the Traditional / Simplified variants of the direct match.
After a brief check:
- The "Traditional mapping files"
cj3-13053.txt
andcj5-13053.txt
(or later ascj3-tc.txt
andcj5-tc.txt
) do not contain cross-mapping data. - All other 6 files contain cross-mapping data.
These cross-mapping data are the reason of this issue. And we need to fix the data.