Testing 金木水火土 gives a lot of unknown and windows-1252/3
I created a simple test script to test uchardet:
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB18030
HZ-GB-2312
IBM852
IBM852
IBM852
IBM852
IBM852
IBM852
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-10
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-5
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
KOI8-R
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CYRILLIC
SHIFT_JIS
TIS-620
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
UTF-8
VISCII
WINDOWS-1250
WINDOWS-1251
WINDOWS-1251
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
WINDOWS-1256
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1252
Windows-1257
Windows-1258
X-ISO-10646-UCS-4-21431
X-ISO-10646-UCS-4-34121
"
# Version
uchardet -v
echo ===
iconv -V
echo ===
# Create file in UTF8 with following char
echo ${BASE_TEST_CONTENT} >${BASE_TEST_FILE}
echo Base test file: ${BASE_TEST_FILE}
echo Base test file content: $(cat ${BASE_TEST_FILE})
# charset should be utf8
BASE_TEST_CHARSET=$(uchardet ${BASE_TEST_FILE})
echo Base test file charset: ${BASE_TEST_CHARSET}
echo ===
#for CS in $(echo $(iconv -l)); do
for CS in ${SUPPORTED_CHARSET}; do
TO_CHARSET=$(echo ${CS} | cut -d/ -f1)
TEST_FILE=test_${TO_CHARSET}.txt
# Create iconv file
iconv -f ${BASE_TEST_CHARSET} -t ${TO_CHARSET} ${BASE_TEST_FILE} >${TEST_FILE} 2>/dev/null
ICONV_RESULT=$?
# Only do test if iconv successful
if [ ${ICONV_RESULT} = 0 ]; then
# uchardet
TEST_RESULT=$(uchardet ${TEST_FILE})
# output
echo iconv to: ${TO_CHARSET}
echo uchardet: ${TEST_RESULT}
# make sure iconv backward is successful
echo iconv back from \"to charset\": $(iconv -t ${BASE_TEST_CHARSET} -f ${TO_CHARSET} ${TEST_FILE} 2>/dev/null)
echo iconv back from uchardet charset: $(iconv -t ${BASE_TEST_CHARSET} -f ${TEST_RESULT} ${TEST_FILE} 2>/dev/null)
echo ---
fi
done
The result is as follow:
uchardet Command Line Tool
Version 0.0.6
Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet
===
iconv (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
===
Base test file: test_0base.txt
Base test file content: 金木水火土
Base test file charset: UTF-8
===
iconv to: BIG5
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ª÷¤ì¤ô¤õ¤g
---
iconv to: EUC-JP
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-KR
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UHC
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-TW
uchardet: KOI8-R
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: оземеуеждх
---
iconv to: GB18030
uchardet: WINDOWS-1253
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ½πΔΎΛ®»πΝΑ
---
iconv to: ISO-2022-CN
uchardet: ASCII
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: =pD>K.;pMA
---
iconv to: ISO-2022-JP
uchardet: ISO-2022-JP
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: ISO-2022-KR
uchardet: ISO-2022-KR
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: SHIFT_JIS
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UTF-16BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-16LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-32BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-32LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-8
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
金木水火土 are chosen because they are the same in Simplified Chinese, Traditional Chinese, Korean and Japanese.
However the result shown a lot of misses. SHIFT_JIS, GB18030, BIG5 are 3 noticeable ones as they are common.
I added UTF-16 and UTF-32 though not mentioned in README.md and they work correctly. However their BE/LE versions failed.