UTF-8 Italian text recognized as ISO-8859-1 Portuguese
@Jehan
Submitted by Jehan Pagès Assigned to Jehan Pagès @Jehan
Link to original bug (#102292)
Description
Created attachment 133604 UTF-8 text.
See: https://github.com/BYVoid/uchardet/issues/36#issuecomment-323316171
The attached text is UTF-8 Italian, but since commit e138839f (Portuguese support for ISO-8859-1), this text is recognized as ISO-8859-1.
Not sure though if there is a proper solution apart from removing Portuguese support on short-term and adding actual language detection to UTF-8, longer term (see bug 101218).
Also obviously the fact that the file just holds 2 words make it a difficult guess for a system based on statistics.
Attachment 133604, "UTF-8 text.":
utf8.txt