src/nsUTF8Prober.cpp · bdd71d88f8347f73f87174028faa25894b1201d8 · uchardet / uchardet

src: drop less of UTF-8 confidence even with few non-multibyte chars. · bed459c6

Jehan authored May 23, 2021

Some languages are not meant to have multibyte characters. For instance,
English would typically have none. Yet you can still have UTF-8 English
text (with a few special characters, or foreign words…). So anyway let's
make it less of a deal breaker.

To be even fairer, the whole logics is biased of course and I believe
that eventually we should get rid of these lines of code dropping
confidence on a number of character. This is a ridiculous rule (we base
on our whole logics on language statistics and suddenly we add some
weird rule with a completely random number). But for now, I'll keep this
as-is until we make the whole library even more robust.

bed459c6