-
Jehan authored
Additionally to the "frequent characters" concept, we add 2 sub-categories, which are the "very frequent characters" and "rare characters". The former are usually just a few characters which are used most of the time (like 3 or 4 characters used 40% of the time!), whereas the later are often a dozen or more characters which are barely used a few percents of the time, all together. We use this additional concept to help distinguish very similar languages, or languages whose frequent characters are a subset of the ones from another language (typically English, whose alphabet is a subset of many other European languages). The mTypicalPositiveRatio is getting rid of, as it was anyway barely of any use (it was 0.99-something for nearly all languages!). Instead we get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course the associated order counts to know which character are in these sets.
401eb55dJehan authoredAdditionally to the "frequent characters" concept, we add 2 sub-categories, which are the "very frequent characters" and "rare characters". The former are usually just a few characters which are used most of the time (like 3 or 4 characters used 40% of the time!), whereas the later are often a dozen or more characters which are barely used a few percents of the time, all together. We use this additional concept to help distinguish very similar languages, or languages whose frequent characters are a subset of the ones from another language (typically English, whose alphabet is a subset of many other European languages). The mTypicalPositiveRatio is getting rid of, as it was anyway barely of any use (it was 0.99-something for nearly all languages!). Instead we get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course the associated order counts to know which character are in these sets.
Loading