Improve robustness for UTF-8 with language awareness

Submitted by Jehan Pagès `@Jehan`

Assigned to Jehan Pagès @Jehan

Description

In bug 101204 is a file example detected as MAC-CENTRALEUROPE though it's actually UTF-8 (full ASCII but one single non-ASCII character). The point is that the file is technically valid in both encoding.

Current code, confidence for UTF-8 (without language awareness) is 0.505 whereas it was 0.535104 for MAC-CENTRALEUROPE. That's basically quite a low confidence for both and the detection to one or another is mostly related to chance.

IMO the difference should be made on language detection as is already the case for single byte encodings. The attached file is code, but that's still close-enough to natural English that I believe the confidence should rise up for the couple (UTF-8, English) rather than a generic UTF-8 detection.

Blocking

Bug 101310
Bug 102292

Improve robustness for UTF-8 with language awareness

Submitted by Jehan Pagès @Jehan

Description

Blocking

Submitted by Jehan Pagès `@Jehan`