Skip to content
  • Jehan's avatar
    src: improve algorithm for confidence computation. · 401eb55d
    Jehan authored
    Additionally to the "frequent characters" concept, we add 2
    sub-categories, which are the "very frequent characters" and "rare
    characters". The former are usually just a few characters which are used
    most of the time (like 3 or 4 characters used 40% of the time!), whereas
    the later are often a dozen or more characters which are barely used a
    few percents of the time, all together.
    
    We use this additional concept to help distinguish very similar
    languages, or languages whose frequent characters are a subset of
    the ones from another language (typically English, whose alphabet is a
    subset of many other European languages).
    
    The mTypicalPositiveRatio is getting rid of, as it was anyway barely of
    any use (it was 0.99-something for nearly all languages!). Instead we
    get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course
    the associated order counts to know which character are in these sets.
    401eb55d