uchardet issueshttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues2020-04-22T19:24:38Zhttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues/14Make a portable executable2020-04-22T19:24:38ZFlorianPerezMake a portable executableHi,
I like your project and I would like to use your tool on a workstation without manually install lib.
(I'm not root on the workstation, so I can't install lib in the /usr/lib/ folder)
Can you explain me how to create a portable exec...Hi,
I like your project and I would like to use your tool on a workstation without manually install lib.
(I'm not root on the workstation, so I can't install lib in the /usr/lib/ folder)
Can you explain me how to create a portable executable of your project ?
Thanks in advance!https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/27Misuse of CMAKE_BINARY_DIR in CMake2021-12-01T16:49:45ZAndreas SteflMisuse of CMAKE_BINARY_DIR in CMakeI believe that `CMAKE_BINARY_DIR` should be `CMAKE_CURRENT_BINARY_DIR` here https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/CMakeLists.txt#L65
I created a PR on GitHub a while ago. https://github.com/freedesktop/uchardet/...I believe that `CMAKE_BINARY_DIR` should be `CMAKE_CURRENT_BINARY_DIR` here https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/CMakeLists.txt#L65
I created a PR on GitHub a while ago. https://github.com/freedesktop/uchardet/pull/1
Was not able to create a PR here that's why I create an issue.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/8no newline at end of file2020-04-22T21:04:04Zzengno newline at end of filehello, when i compiling uchardet to dynamic library on linux, some compiler report warning : no newline at end of file.
I find some cpp file in path uchardet/src/LangModels such as LangEsperantoModel.cpp indeed not end file with a new li...hello, when i compiling uchardet to dynamic library on linux, some compiler report warning : no newline at end of file.
I find some cpp file in path uchardet/src/LangModels such as LangEsperantoModel.cpp indeed not end file with a new line. so i fix this warning by add a new line at end of those file.
This is why compiler report warning:
in C99 standard:
A backslash immediately before a newline has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the C89 Committee generalized this mechanism to permit any token to be continued by interposing a backslash/newline sequence.
Therefore, is that a meaningful way to add a new line at end of those file to avoid warning from compiler?https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/38Performance issue with version 0.0.8 versus 0.0.72024-02-19T10:42:33ZPaul BreenPerformance issue with version 0.0.8 versus 0.0.7Version 0.0.8 (built from source) is orders of magnitude slower than version 0.0.7. I note that version 0.0.8 now does language detection in addition to the encoding; is this slowdown a consequence of that feature?
I was considering up...Version 0.0.8 (built from source) is orders of magnitude slower than version 0.0.7. I note that version 0.0.8 now does language detection in addition to the encoding; is this slowdown a consequence of that feature?
I was considering updating our version, but the slower performance makes that prohibitive. Is there anything that can be done to improve the performance and make it more comparable to that of version 0.0.7?
Here's some quantitative evidence.
OS version:
```bash
$ uchardet --version
uchardet Command Line Tool
Version 0.0.7
Authors: BYVoid, Jehan
Bug Report: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues
```
Latest version built from source:
```bash
$ ./src/tools/uchardet --version
uchardet Command Line Tool
Version 0.0.8
Authors: BYVoid, Jehan
Bug Report: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues
```
Reasonably large files to test:
```bash
$ ls -sh *.csv
76M data.csv
76M utf8.data.csv
```
Time to run for 0.0.7:
```bash
$ time uchardet *.csv
data.csv: ISO-8859-15
utf8.data.csv: UTF-8
real 0m0.427s
user 0m0.419s
sys 0m0.009s
```
Time to run for 0.0.8:
```bash
$ time ./src/tools/uchardet *.csv
data.csv: ISO-8859-15
utf8.data.csv: UTF-8
real 2m29.847s
user 2m29.466s
sys 0m0.093s
```https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/21Please add Greek CP737 support2022-12-18T23:03:18ZunxedPlease add Greek CP737 supportSample attached.
It's content is phrase "Νέο έγγραφο κειμένου" in Greek, encoded to CP737.
uchardet detects this as CP1252
[cp737.txt](/uploads/3480967bba2c9d0a331769a57bde035d/cp737.txt)
Thanks!Sample attached.
It's content is phrase "Νέο έγγραφο κειμένου" in Greek, encoded to CP737.
uchardet detects this as CP1252
[cp737.txt](/uploads/3480967bba2c9d0a331769a57bde035d/cp737.txt)
Thanks!0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/22Please add Hebrew CP862 support2022-12-16T22:37:27ZunxedPlease add Hebrew CP862 supportThis sample file contains string "מערכת להסעת המונים במטרופולין תל אביב" in Hebrew CP862 charset. It is detected by uchardet as "unknown".[cp862.txt](/uploads/7c85b7381c07156dd4298c7fc8d7016a/cp862.txt)This sample file contains string "מערכת להסעת המונים במטרופולין תל אביב" in Hebrew CP862 charset. It is detected by uchardet as "unknown".[cp862.txt](/uploads/7c85b7381c07156dd4298c7fc8d7016a/cp862.txt)0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/24Support cmake exported targets2021-11-09T09:52:16ZPedro López-CabanillasSupport cmake exported targetsIf cmake exported targets are implemented in uchardet, a downstream project using CMake can find and link the libuchardet library directly with cmake (without needing pkg-config at all) this way:
~~~
project(sample LANGUAGES C)
find_pack...If cmake exported targets are implemented in uchardet, a downstream project using CMake can find and link the libuchardet library directly with cmake (without needing pkg-config at all) this way:
~~~
project(sample LANGUAGES C)
find_package ( uchardet )
if (uchardet_FOUND)
add_executable( sample sample.c )
target_link_libraries ( sample PRIVATE uchardet::libuchardet )
endif ()
~~~
The build system should create one exported target for each built target feature, for instance:
- The executable **uchardet::uchardet**
- The shared library **uchardet::libuchardet**
- The static library **uchardet::libuchardet_static**
After installing the project in a prefix like "$HOME/uchardet/", the downstream project can be configured with a command like:
~~~
cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..."
~~~
Instead of installing, the build directory can be used directly, for instance:
~~~
cmake -Duchardet_DIR="$HOME/build-uchardet-0.1.0/" ...
~~~~https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/18Testing 金木水火土 gives a lot of unknown and windows-1252/32020-06-03T19:01:52ZJohn SiuTesting 金木水火土 gives a lot of unknown and windows-1252/3I created a simple test script to test uchardet:
```sh
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB1803...I created a simple test script to test uchardet:
```sh
#!/bin/sh
BASE_TEST_FILE=test_0base.txt
BASE_TEST_CONTENT='金木水火土'
BASE_TEST_CHARSET=''
# List from uchardet readme
SUPPORTED_CHARSET="
ASCII
BIG5
EUC-JP
EUC-KR / UHC
EUC-TW
GB18030
HZ-GB-2312
IBM852
IBM852
IBM852
IBM852
IBM852
IBM852
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-10
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-13
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-15
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-16
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-2
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-3
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-4
ISO-8859-5
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
ISO-8859-9
KOI8-R
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CENTRALEUROPE
MAC-CYRILLIC
SHIFT_JIS
TIS-620
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
UTF-8
VISCII
WINDOWS-1250
WINDOWS-1251
WINDOWS-1251
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
WINDOWS-1256
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1250
Windows-1252
Windows-1257
Windows-1258
X-ISO-10646-UCS-4-21431
X-ISO-10646-UCS-4-34121
"
# Version
uchardet -v
echo ===
iconv -V
echo ===
# Create file in UTF8 with following char
echo ${BASE_TEST_CONTENT} >${BASE_TEST_FILE}
echo Base test file: ${BASE_TEST_FILE}
echo Base test file content: $(cat ${BASE_TEST_FILE})
# charset should be utf8
BASE_TEST_CHARSET=$(uchardet ${BASE_TEST_FILE})
echo Base test file charset: ${BASE_TEST_CHARSET}
echo ===
#for CS in $(echo $(iconv -l)); do
for CS in ${SUPPORTED_CHARSET}; do
TO_CHARSET=$(echo ${CS} | cut -d/ -f1)
TEST_FILE=test_${TO_CHARSET}.txt
# Create iconv file
iconv -f ${BASE_TEST_CHARSET} -t ${TO_CHARSET} ${BASE_TEST_FILE} >${TEST_FILE} 2>/dev/null
ICONV_RESULT=$?
# Only do test if iconv successful
if [ ${ICONV_RESULT} = 0 ]; then
# uchardet
TEST_RESULT=$(uchardet ${TEST_FILE})
# output
echo iconv to: ${TO_CHARSET}
echo uchardet: ${TEST_RESULT}
# make sure iconv backward is successful
echo iconv back from \"to charset\": $(iconv -t ${BASE_TEST_CHARSET} -f ${TO_CHARSET} ${TEST_FILE} 2>/dev/null)
echo iconv back from uchardet charset: $(iconv -t ${BASE_TEST_CHARSET} -f ${TEST_RESULT} ${TEST_FILE} 2>/dev/null)
echo ---
fi
done
```
The result is as follow:
```sh
uchardet Command Line Tool
Version 0.0.6
Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet
===
iconv (Ubuntu GLIBC 2.31-0ubuntu9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
===
Base test file: test_0base.txt
Base test file content: 金木水火土
Base test file charset: UTF-8
===
iconv to: BIG5
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ª÷¤ì¤ô¤õ¤g
---
iconv to: EUC-JP
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-KR
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UHC
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: EUC-TW
uchardet: KOI8-R
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: оземеуеждх
---
iconv to: GB18030
uchardet: WINDOWS-1253
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ½πΔΎΛ®»πΝΑ
---
iconv to: ISO-2022-CN
uchardet: ASCII
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: =pD>K.;pMA
---
iconv to: ISO-2022-JP
uchardet: ISO-2022-JP
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: ISO-2022-KR
uchardet: ISO-2022-KR
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
iconv to: SHIFT_JIS
uchardet: unknown
iconv back from "to charset": 金木水火土
iconv back from uchardet charset:
---
iconv to: UTF-16BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-16LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-32BE
uchardet: WINDOWS-1252
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ‘Ñg(l4pkW
---
iconv to: UTF-32LE
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: ё(g4lkpW
---
iconv to: UTF-8
uchardet: UTF-8
iconv back from "to charset": 金木水火土
iconv back from uchardet charset: 金木水火土
---
```
金木水火土 are chosen because they are the same in Simplified Chinese, Traditional Chinese, Korean and Japanese.
However the result shown a lot of misses. SHIFT_JIS, GB18030, BIG5 are 3 noticeable ones as they are common.
I added UTF-16 and UTF-32 though not mentioned in README.md and they work correctly. However their BE/LE versions failed.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/16Tests failing on x86 Alpine Linux2020-04-28T18:49:09ZRasmus Thomsenoss@cogitri.devTests failing on x86 Alpine LinuxHello,
with uchardet 0.0.7 (and for that matter 0.0.6) some tests are failing on x86, namely:
```
The following tests FAILED:
29 - fi:iso-8859-1 (Failed)
37 - ga:iso-8859-1 (Failed)
106 - th:tis-620 (Failed)
```
Since the tests o...Hello,
with uchardet 0.0.7 (and for that matter 0.0.6) some tests are failing on x86, namely:
```
The following tests FAILED:
29 - fi:iso-8859-1 (Failed)
37 - ga:iso-8859-1 (Failed)
106 - th:tis-620 (Failed)
```
Since the tests only return 1 and don't print a backtrace or any extra info I'm not really sure how to supply extra info.
OS: Alpine Linux Edge (so musl libc).
Arch: x86.https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/26The following two tests both return UTF8?2021-11-17T04:00:16Zyangjiang0217The following two tests both return UTF8?``
char szTestCase1[] = "\xD6\xB1\xC1\xAC"; // GB2312"直连"
``
``
char szTestCase2[] = "\xE7\x9B\xB4\xE8\xBF\x9E";// UTF-8 "直连"
``
The above two detections both return UTF8``
char szTestCase1[] = "\xD6\xB1\xC1\xAC"; // GB2312"直连"
``
``
char szTestCase2[] = "\xE7\x9B\xB4\xE8\xBF\x9E";// UTF-8 "直连"
``
The above two detections both return UTF8https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/34Update the man page2023-11-12T15:12:28ZJehan PagèsUpdate the man pageThe man page is still highly incomplete and should be updated for uchardet 0.1.0.The man page is still highly incomplete and should be updated for uchardet 0.1.0.0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/7UTF-16BE/UTF-16LE without BOM not supported (?)2022-12-30T12:29:21ZZhiming WangUTF-16BE/UTF-16LE without BOM not supported (?)I took some random UTF-8 encoded paragraph of Chinese text from https://zh.wikipedia.org and converted it to `UTF-16`, `UTF-16BE`, `UTF-16LE` with `iconv` (GNU libiconv 1.11 on macOS 10.14). The `UTF-16BE` and `UTF-16LE` version have no ...I took some random UTF-8 encoded paragraph of Chinese text from https://zh.wikipedia.org and converted it to `UTF-16`, `UTF-16BE`, `UTF-16LE` with `iconv` (GNU libiconv 1.11 on macOS 10.14). The `UTF-16BE` and `UTF-16LE` version have no BOM, and in particular, the 2-byte BOM is the only difference between the `UTF-16` and the `UTF-16BE` version. Rather surprisingly, `uchardet` failed on both the `UTF-16BE` and `UTF-16LE` versions:
```console
$ ./src/tools/uchardet zh.utf-8.txt zh.utf-16.txt zh.utf-16be.txt zh.utf-16le.txt
zh.utf-8.txt: UTF-8
zh.utf-16.txt: UTF-16
zh.utf-16be.txt: unknown
zh.utf-16le.txt: WINDOWS-1252
```
I have attached the text files.
Is there any chance this could be improved?
[zh.utf-8.txt](/uploads/a8add06bc9930057245d3a060696ad7d/zh.utf-8.txt) [zh.utf-16.txt](/uploads/5254ea4d0fd95fe02c10295ac2c3c323/zh.utf-16.txt) [zh.utf-16be.txt](/uploads/3f96d49fdbbc323cfcd410aa1db5d2c4/zh.utf-16be.txt) [zh.utf-16le.txt](/uploads/9dcc76792b3300ec569d8559f048fc47/zh.utf-16le.txt)0.1.0https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/6UTF-8 Italian text recognized as ISO-8859-1 Portuguese2018-10-12T21:35:13ZBugzilla Migration UserUTF-8 Italian text recognized as ISO-8859-1 Portuguese## Submitted by Jehan Pagès `@Jehan`
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#102292)](https://bugs.freedesktop.org/show_bug.cgi?id=102292)**
## Description
Created attachment 133604
UTF-8 text.
See: https:/...## Submitted by Jehan Pagès `@Jehan`
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#102292)](https://bugs.freedesktop.org/show_bug.cgi?id=102292)**
## Description
Created attachment 133604
UTF-8 text.
See: https://github.com/BYVoid/uchardet/issues/36#issuecomment-323316171
The attached text is UTF-8 Italian, but since commit e138839f0753e223f7aa2733e8ed829b47a67cac (Portuguese support for ISO-8859-1), this text is recognized as ISO-8859-1.
Not sure though if there is a proper solution apart from removing Portuguese support on short-term and adding actual language detection to UTF-8, longer term (see [bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218)).
Also obviously the fact that the file just holds 2 words make it a difficult guess for a system based on statistics.
**Attachment 133604**, "UTF-8 text.":
[utf8.txt](/uploads/75d7003ce2b8db9de1e18d087848f6f8/utf8.txt)
### Depends on
* [Bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218)https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/4UTF-8 section symbol (0xC2A7) invokes TIS-620 decoding2018-10-12T21:35:07ZBugzilla Migration UserUTF-8 section symbol (0xC2A7) invokes TIS-620 decoding## Submitted by pok..@..il.com
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#101310)](https://bugs.freedesktop.org/show_bug.cgi?id=101310)**
## Description
Created attachment 131730
File containing a single sectio...## Submitted by pok..@..il.com
Assigned to **Jehan Pagès `@Jehan`**
**[Link to original bug (#101310)](https://bugs.freedesktop.org/show_bug.cgi?id=101310)**
## Description
Created attachment 131730
File containing a single section sign in the midst of other text
A single occurrence of the section sign (§) encoded in UTF-8 causes the file to be marked as TIS-620, even if the rest of the text is English. This can be seen with the attached file (which includes a single §); curiously adding more instances of § elsewhere usually causes the file to be correctly detected as UTF-8.
This may be a duplicate of [bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218), but it's a more specific case. This was first reported at https://github.com/notepad-plus-plus/notepad-plus-plus/issues/940, but I've narrowed it down to a bug in uchardet.
**Attachment 131730**, "File containing a single section sign in the midst of other text":
[UTF-8_with_section_sign.txt](/uploads/632065643916974e7f7f902200925c0c/UTF-8_with_section_sign.txt)
### Depends on
* [Bug 101218](https://bugs.freedesktop.org/show_bug.cgi?id=101218)Jehan PagèsJehan Pagèshttps://gitlab.freedesktop.org/uchardet/uchardet/-/issues/23UTF-8 text containing á (u00E1) detected as IBM8522023-05-04T14:48:52ZLubosz SarneckiUTF-8 text containing á (u00E1) detected as IBM852I noticed that several files containing Spanish characters with acute accents causing issues with uchardet.
The files are apparently encoded in `UTF-8`, but are detected as `IBM852`.
Converting with `iconv` and `uconv` from `IBM852` to ...I noticed that several files containing Spanish characters with acute accents causing issues with uchardet.
The files are apparently encoded in `UTF-8`, but are detected as `IBM852`.
Converting with `iconv` and `uconv` from `IBM852` to `UTF-8` produces corrupt files.
Full example file:
https://github.com/lubosz/wiithon/blob/master/config.py
Creating a file with just the word in question can reproduce the problem:
```
aparecerán
```
Shortening the word in question produces a different wrong encoding, in this case `WINDOWS-1258`:
```
cerán
```
This incorrectly detected `WINDOWS-1258` encoding is still the same when the word in question has paragraphs of text around it:
```
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
cerán
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
```
Shortening the word even more produces a correct `UTF-8` detection, even with paragraphs around it:
```
rán
```
Using this online tool I found a way to make the encoding to be detected correctly:
https://subtitletools.com/convert-text-files-to-utf8-online
It inserts 3 bytes `ef bb bf` at the start of the file. This fix is reproducible with all files that have the issue.
```
$ od -t x1 small-a-acute-word.txt
0000000 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000014
```
```
$ od -t x1 small-a-acute-word-fixed.txt
0000000 ef bb bf 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000017
```
The encoding is now correctly detected as `UTF-8`.
Other characters that produce the same issue: `¿`, `í`, `ó`, `ú`, `é`.
All detection issues can be resolved by inserting said bytes at the start of the file, which seems like a hack.
Is this a required header for UTF-8 files, a uchardet bug or my fault?https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/9Wrong detected encoding utf-8 instead of cp8552022-12-18T23:14:13ZRustam SayfutdinovWrong detected encoding utf-8 instead of cp855Hello!
I found an [example](/uploads/ee0f778b278fd777457f25bb99078d3f/sample.txt) where _uchardet_ wrong detected encoding utf-8 instead of cp855.
I debuged my usage (in this [for-loop](https://gitlab.freedesktop.org/uchardet/uchardet/...Hello!
I found an [example](/uploads/ee0f778b278fd777457f25bb99078d3f/sample.txt) where _uchardet_ wrong detected encoding utf-8 instead of cp855.
I debuged my usage (in this [for-loop](https://gitlab.freedesktop.org/uchardet/uchardet/blob/master/src/nsUniversalDetector.cpp#L317)):
- multi-byte prober: utf-8 with _confidence = 0.752499998_
- single-byte prober: cp855 with _confidence = 0.685687244_
Using a different the implementation by [UTF Unknown](https://github.com/CharsetDetector/UTF-unknown/tree/v0.1), I got the expected result:
- multi-byte prober: only check and get _GB18030Prober_ object with _confidence = 0.01_
- single-byte prober: cp855 with _confidence = 0.8776797_0.1.0