Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • U uchardet
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 14
    • Issues 14
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • uchardet
  • uchardet
  • Issues
  • #23
Closed
Open
Issue created Feb 08, 2021 by Lubosz Sarnecki@lubosz

UTF-8 text containing á (u00E1) detected as IBM852

I noticed that several files containing Spanish characters with acute accents causing issues with uchardet. The files are apparently encoded in UTF-8, but are detected as IBM852.

Converting with iconv and uconv from IBM852 to UTF-8 produces corrupt files.

Full example file: https://github.com/lubosz/wiithon/blob/master/config.py

Creating a file with just the word in question can reproduce the problem:

aparecerán

Shortening the word in question produces a different wrong encoding, in this case WINDOWS-1258:

cerán

This incorrectly detected WINDOWS-1258 encoding is still the same when the word in question has paragraphs of text around it:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

cerán

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Shortening the word even more produces a correct UTF-8 detection, even with paragraphs around it:

rán

Using this online tool I found a way to make the encoding to be detected correctly: https://subtitletools.com/convert-text-files-to-utf8-online

It inserts 3 bytes ef bb bf at the start of the file. This fix is reproducible with all files that have the issue.

$ od -t x1 small-a-acute-word.txt 
0000000 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000014
$ od -t x1 small-a-acute-word-fixed.txt 
0000000 ef bb bf 61 70 61 72 65 63 65 72 c3 a1 6e 0a
0000017

The encoding is now correctly detected as UTF-8.

Other characters that produce the same issue: ¿, í, ó, ú, é.

All detection issues can be resolved by inserting said bytes at the start of the file, which seems like a hack.

Is this a required header for UTF-8 files, a uchardet bug or my fault?

Assignee
Assign to
Time tracking