Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • U uchardet
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 14
    • Issues 14
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • uchardet
  • uchardet
  • Issues
  • #18
Closed
Open
Issue created Jun 03, 2020 by John Siu@john.sd.siu

Testing 金木水火土 gives a lot of unknown and windows-1252/3

I created a simple test script to test uchardet:

#!/bin/sh

BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''

# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB18030
HZ-GB-2312
IBM852
IBM852
IBM852
IBM852
IBM852
IBM852
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-10
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-5
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
KOI8-R
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CYRILLIC
SHIFT_JIS
TIS-620
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
UTF-8
VISCII
WINDOWS-1250
WINDOWS-1251
WINDOWS-1251
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
WINDOWS-1256
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1252
Windows-1257
Windows-1258
X-ISO-10646-UCS-4-21431
X-ISO-10646-UCS-4-34121
"

# Version
uchardet -v
echo ===
iconv -V
echo ===

# Create file in UTF8 with following char
echo ${BASE_TEST_CONTENT} >${BASE_TEST_FILE}
echo Base test file: ${BASE_TEST_FILE}
echo Base test file content: $(cat ${BASE_TEST_FILE})
#	charset should be utf8
BASE_TEST_CHARSET=$(uchardet ${BASE_TEST_FILE})
echo Base test file charset: ${BASE_TEST_CHARSET}

echo ===

#for CS in $(echo $(iconv -l)); do
for CS in ${SUPPORTED_CHARSET}; do
	TO_CHARSET=$(echo ${CS} | cut -d/ -f1)
	TEST_FILE=test_${TO_CHARSET}.txt

	# Create iconv file
	iconv -f ${BASE_TEST_CHARSET} -t ${TO_CHARSET} ${BASE_TEST_FILE} >${TEST_FILE} 2>/dev/null
	ICONV_RESULT=$?

	# Only do test if iconv successful
	if [ ${ICONV_RESULT} = 0 ]; then
		# uchardet
		TEST_RESULT=$(uchardet ${TEST_FILE})
		# output
		echo iconv to: ${TO_CHARSET}
		echo uchardet: ${TEST_RESULT}
		# make sure iconv backward is successful
		echo iconv back from \"to charset\": $(iconv -t ${BASE_TEST_CHARSET} -f ${TO_CHARSET} ${TEST_FILE} 2>/dev/null)
		echo iconv back from uchardet charset: $(iconv -t ${BASE_TEST_CHARSET} -f ${TEST_RESULT} ${TEST_FILE} 2>/dev/null)

		echo ---
	fi
done

The result is as follow:

uchardet Command Line Tool
Version 0.0.6

Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet

===
iconv (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
===
Base test file: test_0base.txt
Base test file content: 金木水火土
Base test file charset: UTF-8
===
iconv to: BIG5
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ª÷¤ì¤ô¤õ¤g
---
iconv to: EUC-JP
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-KR
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UHC
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-TW
uchardet: KOI8-R
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: оземеуеждх
---
iconv to: GB18030
uchardet: WINDOWS-1253
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ½πΔΎΛ®»πΝΑ
---
iconv to: ISO-2022-CN
uchardet: ASCII
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: =pD>K.;pMA
---
iconv to: ISO-2022-JP
uchardet: ISO-2022-JP
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: ISO-2022-KR
uchardet: ISO-2022-KR
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: SHIFT_JIS
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UTF-16BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-16LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-32BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-32LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-8
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---

金木水火土 are chosen because they are the same in Simplified Chinese, Traditional Chinese, Korean and Japanese.

However the result shown a lot of misses. SHIFT_JIS, GB18030, BIG5 are 3 noticeable ones as they are common.

I added UTF-16 and UTF-32 though not mentioned in README.md and they work correctly. However their BE/LE versions failed.

Assignee
Assign to
Time tracking