uchardet issueshttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues2021-12-01T16:49:45Zhttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues/27Misuse of CMAKE_BINARY_DIR in CMake2021-12-01T16:49:45ZAndreas SteflMisuse of CMAKE_BINARY_DIR in CMakeI believe that `CMAKE_BINARY_DIR` should be `CMAKE_CURRENT_BINARY_DIR` here https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/CMakeLists.txt#L65
I created a PR on GitHub a while ago. https://github.com/freedesktop/uchardet/...I believe that `CMAKE_BINARY_DIR` should be `CMAKE_CURRENT_BINARY_DIR` here https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/CMakeLists.txt#L65
I created a PR on GitHub a while ago. https://github.com/freedesktop/uchardet/pull/1
Was not able to create a PR here that's why I create an issue.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/26The following two tests both return UTF8?2021-11-17T04:00:16Zyangjiang0217The following two tests both return UTF8?``
char szTestCase1[] = "\xD6\xB1\xC1\xAC"; // GB2312"直连"
``
``
char szTestCase2[] = "\xE7\x9B\xB4\xE8\xBF\x9E";// UTF-8 "直连"
``
The above two detections both return UTF8``
char szTestCase1[] = "\xD6\xB1\xC1\xAC"; // GB2312"直连"
``
``
char szTestCase2[] = "\xE7\x9B\xB4\xE8\xBF\x9E";// UTF-8 "直连"
``
The above two detections both return UTF8https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/20Can libuchardet-ios.a support iOS Simulator?2021-11-09T13:11:45ZYeaLink89Can libuchardet-ios.a support iOS Simulator?libuchardet-ios.a能不能支持下iOS Simulator,现在在模拟器下闪退。libuchardet-ios.a能不能支持下iOS Simulator,现在在模拟器下闪退。https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/24Support cmake exported targets2021-11-09T09:52:16ZPedro López-CabanillasSupport cmake exported targetsIf cmake exported targets are implemented in uchardet, a downstream project using CMake can find and link the libuchardet library directly with cmake (without needing pkg-config at all) this way:
~~~
project(sample LANGUAGES C)
find_pack...If cmake exported targets are implemented in uchardet, a downstream project using CMake can find and link the libuchardet library directly with cmake (without needing pkg-config at all) this way:
~~~
project(sample LANGUAGES C)
find_package ( uchardet )
if (uchardet_FOUND)
add_executable( sample sample.c )
target_link_libraries ( sample PRIVATE uchardet::libuchardet )
endif ()
~~~
The build system should create one exported target for each built target feature, for instance:
- The executable **uchardet::uchardet**
- The shared library **uchardet::libuchardet**
- The static library **uchardet::libuchardet_static**
After installing the project in a prefix like "$HOME/uchardet/", the downstream project can be configured with a command like:
~~~
cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..."
~~~
Instead of installing, the build directory can be used directly, for instance:
~~~
cmake -Duchardet_DIR="$HOME/build-uchardet-0.1.0/" ...
~~~~https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/15Error with file called "-h" (request to support "--" option)2020-07-28T11:45:37ZJamie Landeg-JonesError with file called "-h" (request to support "--" option)Before calling uchardet in a script, I need to first check if the filename is literally called '-h' or '-v', and if so prefix them with a "./" before calling uchardet. (Well, I actually check for filanemes beginning with '-', but you get...Before calling uchardet in a script, I need to first check if the filename is literally called '-h' or '-v', and if so prefix them with a "./" before calling uchardet. (Well, I actually check for filanemes beginning with '-', but you get the point)
Rather than this kludge, could you add the traditonal "--" as an "end of options" marker?
And yes, this problem did crop up in "real life"!
Cheers, Jamiehttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues/18Testing 金木水火土 gives a lot of unknown and windows-1252/32020-06-03T19:01:52ZJohn SiuTesting 金木水火土 gives a lot of unknown and windows-1252/3I created a simple test script to test uchardet:
```sh
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB1803...I created a simple test script to test uchardet:
```sh
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB18030
HZ-GB-2312
IBM852
IBM852
IBM852
IBM852
IBM852
IBM852
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-10
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-5
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
KOI8-R
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CYRILLIC
SHIFT_JIS
TIS-620
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
UTF-8
VISCII
WINDOWS-1250
WINDOWS-1251
WINDOWS-1251
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
WINDOWS-1256
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1252
Windows-1257
Windows-1258
X-ISO-10646-UCS-4-21431
X-ISO-10646-UCS-4-34121
"
# Version
uchardet -v
echo ===
iconv -V
echo ===
# Create file in UTF8 with following char
echo ${BASE_TEST_CONTENT} >${BASE_TEST_FILE}
echo Base test file: ${BASE_TEST_FILE}
echo Base test file content: $(cat ${BASE_TEST_FILE})
# charset should be utf8
BASE_TEST_CHARSET=$(uchardet ${BASE_TEST_FILE})
echo Base test file charset: ${BASE_TEST_CHARSET}
echo ===
#for CS in $(echo $(iconv -l)); do
for CS in ${SUPPORTED_CHARSET}; do
TO_CHARSET=$(echo ${CS} | cut -d/ -f1)
TEST_FILE=test_${TO_CHARSET}.txt
# Create iconv file
iconv -f ${BASE_TEST_CHARSET} -t ${TO_CHARSET} ${BASE_TEST_FILE} >${TEST_FILE} 2>/dev/null
ICONV_RESULT=$?
# Only do test if iconv successful
if [ ${ICONV_RESULT} = 0 ]; then
# uchardet
TEST_RESULT=$(uchardet ${TEST_FILE})
# output
echo iconv to: ${TO_CHARSET}
echo uchardet: ${TEST_RESULT}
# make sure iconv backward is successful
echo iconv back from \"to charset\": $(iconv -t ${BASE_TEST_CHARSET} -f ${TO_CHARSET} ${TEST_FILE} 2>/dev/null)
echo iconv back from uchardet charset: $(iconv -t ${BASE_TEST_CHARSET} -f ${TEST_RESULT} ${TEST_FILE} 2>/dev/null)
echo ---
fi
done
```
The result is as follow:
```sh
uchardet Command Line Tool
Version 0.0.6
Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet
===
iconv (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
===
Base test file: test_0base.txt
Base test file content: 金木水火土
Base test file charset: UTF-8
===
iconv to: BIG5
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ª÷¤ì¤ô¤õ¤g
---
iconv to: EUC-JP
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-KR
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UHC
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-TW
uchardet: KOI8-R
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: оземеуеждх
---
iconv to: GB18030
uchardet: WINDOWS-1253
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ½πΔΎΛ®»πΝΑ
---
iconv to: ISO-2022-CN
uchardet: ASCII
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: =pD>K.;pMA
---
iconv to: ISO-2022-JP
uchardet: ISO-2022-JP
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: ISO-2022-KR
uchardet: ISO-2022-KR
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: SHIFT_JIS
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UTF-16BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-16LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-32BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-32LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-8
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
```
金木水火土 are chosen because they are the same in Simplified Chinese, Traditional Chinese, Korean and Japanese.
However the result shown a lot of misses. SHIFT_JIS, GB18030, BIG5 are 3 noticeable ones as they are common.
I added UTF-16 and UTF-32 though not mentioned in README.md and they work correctly. However their BE/LE versions failed.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/17Broken links in README2020-04-29T14:21:35ZArtem KlevtsovBroken links in READMEList:
- http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
- http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Also there are my binding to R language on CRAN: https://CRAN.R-project.org/package=uchar...List:
- http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
- http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Also there are my binding to R language on CRAN: https://CRAN.R-project.org/package=uchardet
Also QtAV use uchardet.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/16Tests failing on x86 Alpine Linux2020-04-28T18:49:09ZRasmus Thomsenoss@cogitri.devTests failing on x86 Alpine LinuxHello,
with uchardet 0.0.7 (and for that matter 0.0.6) some tests are failing on x86, namely:
```
The following tests FAILED:
29 - fi:iso-8859-1 (Failed)
37 - ga:iso-8859-1 (Failed)
106 - th:tis-620 (Failed)
```
Since the tests o...Hello,
with uchardet 0.0.7 (and for that matter 0.0.6) some tests are failing on x86, namely:
```
The following tests FAILED:
29 - fi:iso-8859-1 (Failed)
37 - ga:iso-8859-1 (Failed)
106 - th:tis-620 (Failed)
```
Since the tests only return 1 and don't print a backtrace or any extra info I'm not really sure how to supply extra info.
OS: Alpine Linux Edge (so musl libc).
Arch: x86.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/11Any plans to make new release?2020-04-23T15:18:15ZTomasz KłoczkoAny plans to make new release?I think that it would be good to make new release with fresh code base out of already accumulated patches in git :)I think that it would be good to make new release with fresh code base out of already accumulated patches in git :)https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/8no newline at end of file2020-04-22T21:04:04Zzengno newline at end of filehello, when i compiling uchardet to dynamic library on linux, some compiler report warning : no newline at end of file.
I find some cpp file in path uchardet/src/LangModels such as LangEsperantoModel.cpp indeed not end file with a new li...hello, when i compiling uchardet to dynamic library on linux, some compiler report warning : no newline at end of file.
I find some cpp file in path uchardet/src/LangModels such as LangEsperantoModel.cpp indeed not end file with a new line. so i fix this warning by add a new line at end of those file.
This is why compiler report warning:
in C99 standard:
A backslash immediately before a newline has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the C89 Committee generalized this mechanism to permit any token to be continued by interposing a backslash/newline sequence.
Therefore, is that a meaningful way to add a new line at end of those file to avoid warning from compiler?https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/13Enhancements in other forks2020-04-22T20:19:20ZJohn Mark VandenbergEnhancements in other forksThere are a lot of enhancements in https://github.com/PyYoshi/uchardet, and probably other forks which have arisen over the years.
https://github.com/search?o=desc&q=uchardet&s=updated&type=Repositories shows how many there are.
I crea...There are a lot of enhancements in https://github.com/PyYoshi/uchardet, and probably other forks which have arisen over the years.
https://github.com/search?o=desc&q=uchardet&s=updated&type=Repositories shows how many there are.
I created https://github.com/PyYoshi/uchardet/issues/5 .
Some scouting about in other forks should also be done to try to bring the patches under one roof.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/10Crashing sequence with nsSJISProber2020-04-22T20:18:03ZJP CimalandoCrashing sequence with nsSJISProberHi. By using charset detection on a set of MIDI file metadata, I have discovered and isolated a crashing sequence.
It happens when uchardet is fed the input as multiple strings, and a string of the set is of length 0.
File attach produc...Hi. By using charset detection on a set of MIDI file metadata, I have discovered and isolated a crashing sequence.
It happens when uchardet is fed the input as multiple strings, and a string of the set is of length 0.
File attach produces the crash. Revision bdfd6116a965fd210ef563613763e724424728b7
[test-case1.cc](/uploads/fb12034b7990551d7488437fffd79dfa/test-case1.cc)
The above file also contains a backtrace.
As observed, a buffer access is attempted at `aLen-1` with value `aLen=0`.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/12Add option to play safe detection2020-04-22T19:52:07ZHansAdd option to play safe detectionI see many reports of wrong encoding detection.
I was about to report another one, and I don't think this will ever be solved.
But if you could add a line option --safe-uf8 (or what you prefer)
That if the option is set and the file can ...I see many reports of wrong encoding detection.
I was about to report another one, and I don't think this will ever be solved.
But if you could add a line option --safe-uf8 (or what you prefer)
That if the option is set and the file can be mapped to utf8 it returns utf8 and if does not map, then return other encoding detected.
For example the file attached has a simple "á" char and the file is detected as : TIS-620
When it could be better assigned to ISO-98859-1 or UTF8
[test.txt](/uploads/0b67668028dde12b5bf9ba3ca68011c4/test.txt)https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/14Make a portable executable2020-04-22T19:24:38ZFlorianPerezMake a portable executableHi,
I like your project and I would like to use your tool on a workstation without manually install lib.
(I'm not root on the workstation, so I can't install lib in the /usr/lib/ folder)
Can you explain me how to create a portable exec...Hi,
I like your project and I would like to use your tool on a workstation without manually install lib.
(I'm not root on the workstation, so I can't install lib in the /usr/lib/ folder)
Can you explain me how to create a portable executable of your project ?
Thanks in advance!https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/6UTF-8 Italian text recognized as ISO-8859-1 Portuguese2018-10-12T21:35:13ZBugzilla Migration UserUTF-8 Italian text recognized as ISO-8859-1 Portuguese## Submitted by Jehan Pagès `@Jehan`
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#102292)](https://bugs.freedesktop.org/show_bug.cgi?id=102292)**
## Description
Created attachment 133604
UTF-8 text.
See: https:/...## Submitted by Jehan Pagès `@Jehan`
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#102292)](https://bugs.freedesktop.org/show_bug.cgi?id=102292)**
## Description
Created attachment 133604
UTF-8 text.
See: https://github.com/BYVoid/uchardet/issues/36#issuecomment-323316171
The attached text is UTF-8 Italian, but since commit e138839f0753e223f7aa2733e8ed829b47a67cac (Portuguese support for ISO-8859-1), this text is recognized as ISO-8859-1.
Not sure though if there is a proper solution apart from removing Portuguese support on short-term and adding actual language detection to UTF-8, longer term (see [bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218)).
Also obviously the fact that the file just holds 2 words make it a difficult guess for a system based on statistics.
**Attachment 133604**, "UTF-8 text.":
[utf8.txt](/uploads/75d7003ce2b8db9de1e18d087848f6f8/utf8.txt)
### Depends on
* [Bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218)https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/4UTF-8 section symbol (0xC2A7) invokes TIS-620 decoding2018-10-12T21:35:07ZBugzilla Migration UserUTF-8 section symbol (0xC2A7) invokes TIS-620 decoding## Submitted by pok..@..il.com
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#101310)](https://bugs.freedesktop.org/show_bug.cgi?id=101310)**
## Description
Created attachment 131730
File containing a single sectio...## Submitted by pok..@..il.com
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#101310)](https://bugs.freedesktop.org/show_bug.cgi?id=101310)**
## Description
Created attachment 131730
File containing a single section sign in the midst of other text
A single occurrence of the section sign (§) encoded in UTF-8 causes the file to be marked as TIS-620, even if the rest of the text is English. This can be seen with the attached file (which includes a single §); curiously adding more instances of § elsewhere usually causes the file to be correctly detected as UTF-8.
This may be a duplicate of [bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218), but it's a more specific case. This was first reported at https://github.com/notepad-plus-plus/notepad-plus-plus/issues/940, but I've narrowed it down to a bug in uchardet.
**Attachment 131730**, "File containing a single section sign in the midst of other text":
[UTF-8_with_section_sign.txt](/uploads/632065643916974e7f7f902200925c0c/UTF-8_with_section_sign.txt)
### Depends on
* [Bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218)Jehan PagèsJehan Pagès