Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • U uchardet
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 14
    • Issues 14
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • uchardet
  • uchardet
  • Issues
  • #5
Closed
Open
Issue created Dec 28, 2017 by Bugzilla Migration User@bugzilla-migration

[Feature request] - More info returned by library

Submitted by bk1..@..il.com

Assigned to Jehan Pagès @Jehan

Link to original bug (#104402)

Description

Hi! I want to suggest what would be nice what the library return more info about the file analized. Maybe to have the confidence rate to decide if detection is good enough. I see what in your responses you quote the confidence rate but I dont see available in the returned functions of the library.

Maybe a return object type record (in Pascal... I dont know what is the name in C) like Charset Detector (http://chsdet.sourceforge.net/api.php)

rCharsetInfo = record Name: pChar; // charset name CodePage: integer; // MS Windows CodePage id Language: pChar; // end;

...maybe a new field in a structure like that

Confidence: float

Another good addition would be is the file as BOM or not and what kind of BOM

eBOMKind =( BOM_Not_Found, BOM_UCS4_BE, // 00 00 FE FF UCS-4, big-endian machine (1234 order) BOM_UCS4_LE, // FF FE 00 00 UCS-4, little-endian machine (4321 order) BOM_UCS4_2143, // 00 00 FF FE UCS-4, unusual octet order (2143) BOM_UCS4_3412, // FE FF 00 00 UCS-4, unusual octet order (3412) BOM_UTF16_BE, // FE FF ## ## UTF-16, big-endian BOM_UTF16_LE, // FF FE ## ## UTF-16, little-endian BOM_UTF8 // EF BB BF UTF-8 );

And becoming greedy would be nice to have the kind of Newline the file has

Unix/Mac // LF $0D Windows // LF+CR $0D $0A Old Mac // CR $0A

Sorry for ask to much!!! Thanks in advance

Assignee
Assign to
Time tracking