uchardet issueshttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues2024-02-19T10:42:33Zhttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues/38Performance issue with version 0.0.8 versus 0.0.72024-02-19T10:42:33ZPaul BreenPerformance issue with version 0.0.8 versus 0.0.7Version 0.0.8 (built from source) is orders of magnitude slower than version 0.0.7. I note that version 0.0.8 now does language detection in addition to the encoding; is this slowdown a consequence of that feature?
I was considering up...Version 0.0.8 (built from source) is orders of magnitude slower than version 0.0.7. I note that version 0.0.8 now does language detection in addition to the encoding; is this slowdown a consequence of that feature?
I was considering updating our version, but the slower performance makes that prohibitive. Is there anything that can be done to improve the performance and make it more comparable to that of version 0.0.7?
Here's some quantitative evidence.
OS version:
```bash
$ uchardet --version
uchardet Command Line Tool
Version 0.0.7
Authors: BYVoid, Jehan
Bug Report: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues
```
Latest version built from source:
```bash
$ ./src/tools/uchardet --version
uchardet Command Line Tool
Version 0.0.8
Authors: BYVoid, Jehan
Bug Report: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues
```
Reasonably large files to test:
```bash
$ ls -sh *.csv
76M data.csv
76M utf8.data.csv
```
Time to run for 0.0.7:
```bash
$ time uchardet *.csv
data.csv: ISO-8859-15
utf8.data.csv: UTF-8
real 0m0.427s
user 0m0.419s
sys 0m0.009s
```
Time to run for 0.0.8:
```bash
$ time ./src/tools/uchardet *.csv
data.csv: ISO-8859-15
utf8.data.csv: UTF-8
real 2m29.847s
user 2m29.466s
sys 0m0.093s
```https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/37add cmake-ninja-msvc build2023-11-19T05:38:26Ztelppaadd cmake-ninja-msvc build[src-CMakeLists.patch](/uploads/2da4f1f03fb1ce64497e69daf8053e88/src-CMakeLists.patch)
2 changes were made to file `uchardet-0.0.8\src\CMakeLists.txt`
- change 1 fix error by VS2019 - Menu bar - Project - Generate Cache
```
1> [CMake...[src-CMakeLists.patch](/uploads/2da4f1f03fb1ce64497e69daf8053e88/src-CMakeLists.patch)
2 changes were made to file `uchardet-0.0.8\src\CMakeLists.txt`
- change 1 fix error by VS2019 - Menu bar - Project - Generate Cache
```
1> [CMake] -- Configuring done
1> [CMake] -- Generating done
1> [CMake] CMake Error:
1> [CMake] Running
1> [CMake]
1> [CMake] 'C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/Ninja/ninja.exe' '-C' 'X:/uchardet-0.0.8/uchardet-0.0.8/out/build/x64-Debug' '-t' 'recompact'
1> [CMake]
1> [CMake] failed with:
1> [CMake]
1> [CMake] ninja: error: build.ninja:1063: multiple rules generate src/uchardet.lib [-w dupbuild=err]
1> [CMake]
1> [CMake]
1> [CMake]
1> [CMake]
1> [CMake]
1> [CMake] CMake Generate step failed. Build files cannot be regenerated correctly.
```
- change 2 fix error by VS2019 - Menu bar - Build - Build All
```
Error C1083 Cannot open include file: 'getopt.h': No such file or directory X:\uchardet-0.0.8\uchardet-0.0.8\out\build\x64-Debug\uchardet-0.0.8 X:\uchardet-0.0.8\uchardet-0.0.8\src\tools\uchardet.cpp 38
```https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/36A crafted sequence of bytes triggers memory write past the bounds of a heap a...2023-11-13T20:10:38ZJaroslav LobačevskiA crafted sequence of bytes triggers memory write past the bounds of a heap allocated buffer.Hi, creating the public issue here since the issue affects unreleased master branch only. Originally it was reported as a private issue https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/33
## Tested Version
Master branch, post ...Hi, creating the public issue here since the issue affects unreleased master branch only. Originally it was reported as a private issue https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/33
## Tested Version
Master branch, post [0.0.8](https://gitlab.freedesktop.org/uchardet/uchardet/-/releases/v0.0.8).
## Details
### Heap buffer write overflow in nsUTF8Prober::HandleData
~~The out of bounds write happens in [`nsUTF8Prober::HandleData`](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/bdd71d88f8347f73f87174028faa25894b1201d8/src/nsUTF8Prober.cpp#L83) [1] when `*codePointBufferIdx` becomes bigger than the size of the buffer `*codePointBuffer`. The buffer can be zero or [`1024`](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/bdd71d88f8347f73f87174028faa25894b1201d8/src/nsMBCSGroupProber.cpp#L256). A dedicated [`codePointBufferSize`](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/bdd71d88f8347f73f87174028faa25894b1201d8/src/nsMBCSGroupProber.cpp#L255) is allocated for each codepoint prober. However it is not passed to the function as an argument [2] and while the index is incremented [1] in a loop [2] the function has no knowledge about the true size of the buffer.~~
```cpp
nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen,
int** codePointBuffer,
int* codePointBufferIdx) // [2]
{
PRUint32 codingState;
for (PRUint32 i = 0; i < aLen; i++) // [3]
{
...
if (codingState == eStart)
{
...
(*codePointBuffer)[(*codePointBufferIdx)++] = currentCodePoint; // [1]
currentCodePoint = 0;
}
else
{
currentCodePoint = ((0xff & aBuf[i]) & 0x3fu) | (currentCodePoint << 6);
}
}
...
}
```
**Update:** after https://gitlab.freedesktop.org/uchardet/uchardet/-/commit/ab1d2f1120297af6537f2a0d09dca589d4c3ea3b it happens in slightly different place: [`if (keepNext) {` block (line 383)](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/ab1d2f1120297af6537f2a0d09dca589d4c3ea3b/src/nsMBCSGroupProber.cpp#L383)
#### Impact
This issue may lead to an arbitrary code execution.
#### Resources
[crash-6c48314f16273784c008ae3c580a8e636a5a17a1-minimized](/uploads/c6d8dcac35b37ac66d6a9686cb2a5e51/crash-6c48314f16273784c008ae3c580a8e636a5a17a1-minimized)
To reproduce the issue:
1. Make [ASAN](https://github.com/google/sanitizers/wiki/AddressSanitizer) build
2. Run the following program to hit the breakpoint or out of bounds access with ASAN
```cpp
FILE *f = fopen("crash-6c48314f16273784c008ae3c580a8e636a5a17a1-minimized", "rb");
fseek(f, 0, SEEK_END);
long fsize = ftell(f);
fseek(f, 0, SEEK_SET);
char *buf = malloc(fsize);
fread(string, fsize, 1, f);
fclose(f);
uchardet_t ud = uchardet_new();
uchardet_handle_data(ud, buf, fsize);
```
The output when built with [ASAN](https://github.com/google/sanitizers/wiki/AddressSanitizer):
```
==12==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x621000003900 at pc 0x000000593cf3 bp 0x7fff9662e510 sp 0x7fff9662e508
WRITE of size 4 at 0x621000003900 thread T0
SCARINESS: 36 (4-byte-write-heap-buffer-overflow)
#0 0x593cf2 in nsUTF8Prober::HandleData(char const*, unsigned int, int**, int*) /src/uchardet/src/nsUTF8Prober.cpp:83:51
#1 0x584e95 in nsMBCSGroupProber::HandleData(char const*, unsigned int, int**, int*) /src/uchardet/src/nsMBCSGroupProber.cpp:383:27
#2 0x57e3cd in nsUniversalDetector::HandleData(char const*, unsigned int) /src/uchardet/src/nsUniversalDetector.cpp:275:34
#3 0x5786ae in uchardet_handle_data /src/uchardet/src/uchardet.cpp:220:63
#4 0x57842c in LLVMFuzzerTestOneInput /src/uchardet/fuzz/fuzz_uchardet.c:6:2
#5 0x449e23 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
#6 0x435582 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
#7 0x43ae2c in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
#8 0x464362 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
#9 0x7ff3b4264082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
#10 0x42b74d in _start (/out/fuzz_uchardet+0x42b74d)
DEDUP_TOKEN: nsUTF8Prober::HandleData(char const*, unsigned int, int**, int*)--nsMBCSGroupProber::HandleData(char const*, unsigned int, int**, int*)--nsUniversalDetector::HandleData(char const*, unsigned int)
0x621000003900 is located 0 bytes to the right of 4096-byte region [0x621000002900,0x621000003900)
allocated by thread T0 here:
#0 0x575f0d in operator new[](unsigned long) /src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:98:3
#1 0x583c2b in nsMBCSGroupProber::Reset() /src/uchardet/src/nsMBCSGroupProber.cpp:256:30
#2 0x5820a5 in nsMBCSGroupProber::nsMBCSGroupProber(unsigned int) /src/uchardet/src/nsMBCSGroupProber.cpp:141:3
#3 0x57e132 in nsUniversalDetector::HandleData(char const*, unsigned int) /src/uchardet/src/nsUniversalDetector.cpp:205:36
#4 0x5786ae in uchardet_handle_data /src/uchardet/src/uchardet.cpp:220:63
#5 0x57842c in LLVMFuzzerTestOneInput /src/uchardet/fuzz/fuzz_uchardet.c:6:2
#6 0x449e23 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
#7 0x435582 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
#8 0x43ae2c in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
#9 0x464362 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
#10 0x7ff3b4264082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
DEDUP_TOKEN: operator new[](unsigned long)--nsMBCSGroupProber::Reset()--nsMBCSGroupProber::nsMBCSGroupProber(unsigned int)
SUMMARY: AddressSanitizer: heap-buffer-overflow /src/uchardet/src/nsUTF8Prober.cpp:83:51 in nsUTF8Prober::HandleData(char const*, unsigned int, int**, int*)
Shadow bytes around the buggy address:
0x0c427fff86d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c427fff86e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c427fff86f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c427fff8700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c427fff8710: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c427fff8720:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c427fff8730: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c427fff8740: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c427fff8750: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c427fff8760: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c427fff8770: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==12==ABORTING
```https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/35ASCII should also have language recognition2023-11-12T15:13:46ZJehan PagèsASCII should also have language recognitionI think that text recognized as ASCII is still basically bypassing/shortcutting other tests. This should not happen anymore for language recognition.I think that text recognized as ASCII is still basically bypassing/shortcutting other tests. This should not happen anymore for language recognition.0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/34Update the man page2023-11-12T15:12:28ZJehan PagèsUpdate the man pageThe man page is still highly incomplete and should be updated for uchardet 0.1.0.The man page is still highly incomplete and should be updated for uchardet 0.1.0.0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/31EBCDIC: How to contribute?2023-07-17T18:19:34ZHannibalEBCDIC: How to contribute?I would like to create some charset tables to enable uchardet to guess EBCDIC encodings.
Is there a tutorial or a short guide that can help me to understand how to contribute?
Is this [README](https://gitlab.freedesktop.org/uchardet/ucha...I would like to create some charset tables to enable uchardet to guess EBCDIC encodings.
Is there a tutorial or a short guide that can help me to understand how to contribute?
Is this [README](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/script/README) the right starting point?https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/30add cmake-ninja-msvc build and msvc solution build2022-12-15T14:10:15Zhongnodadd cmake-ninja-msvc build and msvc solution build[0001-Add-Cmake-ninja-msvc-build.patch](/uploads/ad0a2ba047e872758da47e084061e621/0001-Add-Cmake-ninja-msvc-build.patch)
[0002-build-with-msvc-solution.patch](/uploads/dd3e6849d65d44232c022321cbbd727e/0002-build-with-msvc-solution.patch)[0001-Add-Cmake-ninja-msvc-build.patch](/uploads/ad0a2ba047e872758da47e084061e621/0001-Add-Cmake-ninja-msvc-build.patch)
[0002-build-with-msvc-solution.patch](/uploads/dd3e6849d65d44232c022321cbbd727e/0002-build-with-msvc-solution.patch)https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/29add support for IBM 880 code page (also known as CP 20880, "IBM EBCDIC Cyrill...2022-12-20T11:28:57Zunxedadd support for IBM 880 code page (also known as CP 20880, "IBM EBCDIC Cyrillic")https://wutils.com/encodings/ibm880https://wutils.com/encodings/ibm880https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/28file with pseudographic chars in cp437 wrongly detected as windows 12522022-02-19T22:16:40Zunxedfile with pseudographic chars in cp437 wrongly detected as windows 1252[statuses.pas](/uploads/5affac08edb8b73208d6da8f7e5543f9/statuses.pas)
The attached file has comments written using pseudo graphic chars, like the following one:
```
{ 1 2 3
╔════╤════╤════╗
CAppStat...[statuses.pas](/uploads/5affac08edb8b73208d6da8f7e5543f9/statuses.pas)
The attached file has comments written using pseudo graphic chars, like the following one:
```
{ 1 2 3
╔════╤════╤════╗
CAppStatus ║ 2 │ 5 │ 4 ║
╚══╤═╧══╤═╧══╤═╝
Normal Text──────┘ │ │
Other─────────────────┘ │
Highlighted Text───────────┘ }
```
uchardet wrongly detects its code page as WINDOWS-1252.
tested with uchardet command-line tool 0.0.6 on Ubuntu 20.04https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/27Misuse of CMAKE_BINARY_DIR in CMake2021-12-01T16:49:45ZAndreas SteflMisuse of CMAKE_BINARY_DIR in CMakeI believe that `CMAKE_BINARY_DIR` should be `CMAKE_CURRENT_BINARY_DIR` here https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/CMakeLists.txt#L65
I created a PR on GitHub a while ago. https://github.com/freedesktop/uchardet/...I believe that `CMAKE_BINARY_DIR` should be `CMAKE_CURRENT_BINARY_DIR` here https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/CMakeLists.txt#L65
I created a PR on GitHub a while ago. https://github.com/freedesktop/uchardet/pull/1
Was not able to create a PR here that's why I create an issue.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/26The following two tests both return UTF8?2021-11-17T04:00:16Zyangjiang0217The following two tests both return UTF8?``
char szTestCase1[] = "\xD6\xB1\xC1\xAC"; // GB2312"直连"
``
``
char szTestCase2[] = "\xE7\x9B\xB4\xE8\xBF\x9E";// UTF-8 "直连"
``
The above two detections both return UTF8``
char szTestCase1[] = "\xD6\xB1\xC1\xAC"; // GB2312"直连"
``
``
char szTestCase2[] = "\xE7\x9B\xB4\xE8\xBF\x9E";// UTF-8 "直连"
``
The above two detections both return UTF8https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/25Different results when using on Ubuntu 16.04 and Ubuntu 20.042022-12-17T22:30:38ZSjors OttjesDifferent results when using on Ubuntu 16.04 and Ubuntu 20.04I have an application using uchardet running on an Ubuntu 16.04 server. I'm trying to update the server to Ubuntu 20.04, but I'm running into the problem that the results from uchardet are sometimes different. In some cases, the result f...I have an application using uchardet running on an Ubuntu 16.04 server. I'm trying to update the server to Ubuntu 20.04, but I'm running into the problem that the results from uchardet are sometimes different. In some cases, the result from 16.04 is correct, and the result from 20.04 is incorrect. I'm running uchardet version 0.0.6 on both machines.
Results on 16.04 and 18.04 seem to be the same. Results on 20.04 are different.
Is there anything I can do to make uchardet return the same results on Ubuntu 20.04 as it does on Ubuntu 16.04?https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/24Support cmake exported targets2021-11-09T09:52:16ZPedro López-CabanillasSupport cmake exported targetsIf cmake exported targets are implemented in uchardet, a downstream project using CMake can find and link the libuchardet library directly with cmake (without needing pkg-config at all) this way:
~~~
project(sample LANGUAGES C)
find_pack...If cmake exported targets are implemented in uchardet, a downstream project using CMake can find and link the libuchardet library directly with cmake (without needing pkg-config at all) this way:
~~~
project(sample LANGUAGES C)
find_package ( uchardet )
if (uchardet_FOUND)
add_executable( sample sample.c )
target_link_libraries ( sample PRIVATE uchardet::libuchardet )
endif ()
~~~
The build system should create one exported target for each built target feature, for instance:
- The executable **uchardet::uchardet**
- The shared library **uchardet::libuchardet**
- The static library **uchardet::libuchardet_static**
After installing the project in a prefix like "$HOME/uchardet/", the downstream project can be configured with a command like:
~~~
cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..."
~~~
Instead of installing, the build directory can be used directly, for instance:
~~~
cmake -Duchardet_DIR="$HOME/build-uchardet-0.1.0/" ...
~~~~https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/23UTF-8 text containing á (u00E1) detected as IBM8522023-05-04T14:48:52ZLubosz SarneckiUTF-8 text containing á (u00E1) detected as IBM852I noticed that several files containing Spanish characters with acute accents causing issues with uchardet.
The files are apparently encoded in `UTF-8`, but are detected as `IBM852`.
Converting with `iconv` and `uconv` from `IBM852` to ...I noticed that several files containing Spanish characters with acute accents causing issues with uchardet.
The files are apparently encoded in `UTF-8`, but are detected as `IBM852`.
Converting with `iconv` and `uconv` from `IBM852` to `UTF-8` produces corrupt files.
Full example file:
https://github.com/lubosz/wiithon/blob/master/config.py
Creating a file with just the word in question can reproduce the problem:
```
aparecerán
```
Shortening the word in question produces a different wrong encoding, in this case `WINDOWS-1258`:
```
cerán
```
This incorrectly detected `WINDOWS-1258` encoding is still the same when the word in question has paragraphs of text around it:
```
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
cerán
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
```
Shortening the word even more produces a correct `UTF-8` detection, even with paragraphs around it:
```
rán
```
Using this online tool I found a way to make the encoding to be detected correctly:
https://subtitletools.com/convert-text-files-to-utf8-online
It inserts 3 bytes `ef bb bf` at the start of the file. This fix is reproducible with all files that have the issue.
```
$ od -t x1 small-a-acute-word.txt
0000000 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000014
```
```
$ od -t x1 small-a-acute-word-fixed.txt
0000000 ef bb bf 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000017
```
The encoding is now correctly detected as `UTF-8`.
Other characters that produce the same issue: `¿`, `í`, `ó`, `ú`, `é`.
All detection issues can be resolved by inserting said bytes at the start of the file, which seems like a hack.
Is this a required header for UTF-8 files, a uchardet bug or my fault?https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/22Please add Hebrew CP862 support2022-12-16T22:37:27ZunxedPlease add Hebrew CP862 supportThis sample file contains string "מערכת להסעת המונים במטרופולין תל אביב" in Hebrew CP862 charset. It is detected by uchardet as "unknown".[cp862.txt](/uploads/7c85b7381c07156dd4298c7fc8d7016a/cp862.txt)This sample file contains string "מערכת להסעת המונים במטרופולין תל אביב" in Hebrew CP862 charset. It is detected by uchardet as "unknown".[cp862.txt](/uploads/7c85b7381c07156dd4298c7fc8d7016a/cp862.txt)0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/21Please add Greek CP737 support2022-12-18T23:03:18ZunxedPlease add Greek CP737 supportSample attached.
It's content is phrase "Νέο έγγραφο κειμένου" in Greek, encoded to CP737.
uchardet detects this as CP1252
[cp737.txt](/uploads/3480967bba2c9d0a331769a57bde035d/cp737.txt)
Thanks!Sample attached.
It's content is phrase "Νέο έγγραφο κειμένου" in Greek, encoded to CP737.
uchardet detects this as CP1252
[cp737.txt](/uploads/3480967bba2c9d0a331769a57bde035d/cp737.txt)
Thanks!0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/20Can libuchardet-ios.a support iOS Simulator?2021-11-09T13:11:45ZYeaLink89Can libuchardet-ios.a support iOS Simulator?libuchardet-ios.a能不能支持下iOS Simulator,现在在模拟器下闪退。libuchardet-ios.a能不能支持下iOS Simulator,现在在模拟器下闪退。https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/19Encoding MAC-CENTRALEUROPE don't match iconv name2022-12-22T18:44:14ZDan BowenEncoding MAC-CENTRALEUROPE don't match iconv nameuchardet outputs MAC-CENTRALEUROPE, but iconv wants MACCENTRALEUROPE.uchardet outputs MAC-CENTRALEUROPE, but iconv wants MACCENTRALEUROPE.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/18Testing 金木水火土 gives a lot of unknown and windows-1252/32020-06-03T19:01:52ZJohn SiuTesting 金木水火土 gives a lot of unknown and windows-1252/3I created a simple test script to test uchardet:
```sh
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB1803...I created a simple test script to test uchardet:
```sh
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB18030
HZ-GB-2312
IBM852
IBM852
IBM852
IBM852
IBM852
IBM852
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-10
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-5
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
KOI8-R
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CYRILLIC
SHIFT_JIS
TIS-620
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
UTF-8
VISCII
WINDOWS-1250
WINDOWS-1251
WINDOWS-1251
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
WINDOWS-1256
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1252
Windows-1257
Windows-1258
X-ISO-10646-UCS-4-21431
X-ISO-10646-UCS-4-34121
"
# Version
uchardet -v
echo ===
iconv -V
echo ===
# Create file in UTF8 with following char
echo ${BASE_TEST_CONTENT} >${BASE_TEST_FILE}
echo Base test file: ${BASE_TEST_FILE}
echo Base test file content: $(cat ${BASE_TEST_FILE})
# charset should be utf8
BASE_TEST_CHARSET=$(uchardet ${BASE_TEST_FILE})
echo Base test file charset: ${BASE_TEST_CHARSET}
echo ===
#for CS in $(echo $(iconv -l)); do
for CS in ${SUPPORTED_CHARSET}; do
TO_CHARSET=$(echo ${CS} | cut -d/ -f1)
TEST_FILE=test_${TO_CHARSET}.txt
# Create iconv file
iconv -f ${BASE_TEST_CHARSET} -t ${TO_CHARSET} ${BASE_TEST_FILE} >${TEST_FILE} 2>/dev/null
ICONV_RESULT=$?
# Only do test if iconv successful
if [ ${ICONV_RESULT} = 0 ]; then
# uchardet
TEST_RESULT=$(uchardet ${TEST_FILE})
# output
echo iconv to: ${TO_CHARSET}
echo uchardet: ${TEST_RESULT}
# make sure iconv backward is successful
echo iconv back from \"to charset\": $(iconv -t ${BASE_TEST_CHARSET} -f ${TO_CHARSET} ${TEST_FILE} 2>/dev/null)
echo iconv back from uchardet charset: $(iconv -t ${BASE_TEST_CHARSET} -f ${TEST_RESULT} ${TEST_FILE} 2>/dev/null)
echo ---
fi
done
```
The result is as follow:
```sh
uchardet Command Line Tool
Version 0.0.6
Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet
===
iconv (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
===
Base test file: test_0base.txt
Base test file content: 金木水火土
Base test file charset: UTF-8
===
iconv to: BIG5
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ª÷¤ì¤ô¤õ¤g
---
iconv to: EUC-JP
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-KR
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UHC
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-TW
uchardet: KOI8-R
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: оземеуеждх
---
iconv to: GB18030
uchardet: WINDOWS-1253
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ½πΔΎΛ®»πΝΑ
---
iconv to: ISO-2022-CN
uchardet: ASCII
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: =pD>K.;pMA
---
iconv to: ISO-2022-JP
uchardet: ISO-2022-JP
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: ISO-2022-KR
uchardet: ISO-2022-KR
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: SHIFT_JIS
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UTF-16BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-16LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-32BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-32LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-8
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
```
金木水火土 are chosen because they are the same in Simplified Chinese, Traditional Chinese, Korean and Japanese.
However the result shown a lot of misses. SHIFT_JIS, GB18030, BIG5 are 3 noticeable ones as they are common.
I added UTF-16 and UTF-32 though not mentioned in README.md and they work correctly. However their BE/LE versions failed.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/17Broken links in README2020-04-29T14:21:35ZArtem KlevtsovBroken links in READMEList:
- http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
- http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Also there are my binding to R language on CRAN: https://CRAN.R-project.org/package=uchar...List:
- http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
- http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Also there are my binding to R language on CRAN: https://CRAN.R-project.org/package=uchardet
Also QtAV use uchardet.