src/nsLanguageDetector.cpp · bdd71d88f8347f73f87174028faa25894b1201d8 · uchardet / uchardet

src: improve algorithm for confidence computation. · 401eb55d

Jehan authored Dec 14, 2022

Additionally to the "frequent characters" concept, we add 2
sub-categories, which are the "very frequent characters" and "rare
characters". The former are usually just a few characters which are used
most of the time (like 3 or 4 characters used 40% of the time!), whereas
the later are often a dozen or more characters which are barely used a
few percents of the time, all together.

We use this additional concept to help distinguish very similar
languages, or languages whose frequent characters are a subset of
the ones from another language (typically English, whose alphabet is a
subset of many other European languages).

The mTypicalPositiveRatio is getting rid of, as it was anyway barely of
any use (it was 0.99-something for nearly all languages!). Instead we
get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course
the associated order counts to know which character are in these sets.

401eb55d

Admin message