UTF-8 text containing á (u00E1) detected as IBM852
I noticed that several files containing Spanish characters with acute accents causing issues with uchardet.
The files are apparently encoded in UTF-8
, but are detected as IBM852
.
Converting with iconv
and uconv
from IBM852
to UTF-8
produces corrupt files.
Full example file: https://github.com/lubosz/wiithon/blob/master/config.py
Creating a file with just the word in question can reproduce the problem:
aparecerán
Shortening the word in question produces a different wrong encoding, in this case WINDOWS-1258
:
cerán
This incorrectly detected WINDOWS-1258
encoding is still the same when the word in question has paragraphs of text around it:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
cerán
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Shortening the word even more produces a correct UTF-8
detection, even with paragraphs around it:
rán
Using this online tool I found a way to make the encoding to be detected correctly: https://subtitletools.com/convert-text-files-to-utf8-online
It inserts 3 bytes ef bb bf
at the start of the file. This fix is reproducible with all files that have the issue.
$ od -t x1 small-a-acute-word.txt
0000000 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000014
$ od -t x1 small-a-acute-word-fixed.txt
0000000 ef bb bf 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000017
The encoding is now correctly detected as UTF-8
.
Other characters that produce the same issue: ¿
, í
, ó
, ú
, é
.
All detection issues can be resolved by inserting said bytes at the start of the file, which seems like a hack.
Is this a required header for UTF-8 files, a uchardet bug or my fault?