Discussion:
Encoding of etc/HELLO
(too old to reply)
Eli Zaretskii
2018-04-20 13:25:28 UTC
Permalink
Michael, your recent changes to encode HELLO in UTF-8 are problematic
and AFAIU should be reverted, because they lose the CJK charset
information. See

http://lists.gnu.org/archive/html/emacs-devel/2009-08/msg01409.html
http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00429.html
http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00475.html

and the surrounding discussions for more about that.
Michael Albinus
2018-04-20 15:34:45 UTC
Permalink
Eli Zaretskii <***@gnu.org> writes:

Hi Eli,
Post by Eli Zaretskii
Michael, your recent changes to encode HELLO in UTF-8 are problematic
and AFAIU should be reverted, because they lose the CJK charset
information. See
http://lists.gnu.org/archive/html/emacs-devel/2009-08/msg01409.html
http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00429.html
http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00475.html
and the surrounding discussions for more about that.
I see. No problem to revert the patch, it isn't important.

However, quoting the last reference above

--8<---------------cut here---------------start------------->8---
When a file is in some legacy encoding such as iso-2022-7bit, Emacs
attached charset properties to proper ranges of text, which works as a
hint for selecting a proper font especially for CJK characters.
--8<---------------cut here---------------end--------------->8---

I'm wondering why it is possible to attach charset properties for
iso-2022-7bit, but not for utf-8. Note, that I don't know too much about
this topic.

Best regards, Michael.
Eli Zaretskii
2018-04-20 16:00:30 UTC
Permalink
Date: Fri, 20 Apr 2018 17:34:45 +0200
--8<---------------cut here---------------start------------->8---
When a file is in some legacy encoding such as iso-2022-7bit, Emacs
attached charset properties to proper ranges of text, which works as a
hint for selecting a proper font especially for CJK characters.
--8<---------------cut here---------------end--------------->8---
I'm wondering why it is possible to attach charset properties for
iso-2022-7bit, but not for utf-8. Note, that I don't know too much about
this topic.
Because we don't have infrastructure for tagging sub-ranges of Unicode
with character sets (and in some sense, that would make little sense,
because Unicode is a unifying encoding).

ISO-2022 has built-in features to tag portions of text as belonging to
some specific charset.
Stefan Monnier
2018-04-20 16:16:05 UTC
Permalink
Post by Eli Zaretskii
Because we don't have infrastructure for tagging sub-ranges of Unicode
with character sets (and in some sense, that would make little sense,
because Unicode is a unifying encoding).
Does Unicode offer a way to do that (i.e. is it a limitation on our
support of Unicode, or is it a limitation in the Unicode spec)?


Stefan
Eli Zaretskii
2018-04-20 17:22:32 UTC
Permalink
Date: Fri, 20 Apr 2018 12:16:05 -0400
Post by Eli Zaretskii
Because we don't have infrastructure for tagging sub-ranges of Unicode
with character sets (and in some sense, that would make little sense,
because Unicode is a unifying encoding).
Does Unicode offer a way to do that (i.e. is it a limitation on our
support of Unicode, or is it a limitation in the Unicode spec)?
Unicode has language tag characters, but they are deprecated and their
use is discouraged.

In any case, I don't think Unicode features are relevant here, because
we already have char-script-table, which is all you can do with a
unified codepoint space. The whole point of ISO-2022 is that the same
Unicode codepoints can come from different ISO-2022 charsets, and the
ISO-2022 encoding keeps that information in the bytestream.
Stefan Monnier
2018-04-20 20:42:02 UTC
Permalink
Post by Eli Zaretskii
Unicode has language tag characters, but they are deprecated and their
use is discouraged.
In any case, I don't think Unicode features are relevant here, because
we already have char-script-table, which is all you can do with a
unified codepoint space.
Yes, I understand this part of the situation.
Post by Eli Zaretskii
The whole point of ISO-2022 is that the same Unicode codepoints can
come from different ISO-2022 charsets, and the ISO-2022 encoding keeps
that information in the bytestream.
My question was meant to see if there's a way to encode a similar kind
of charset info into the bytestream. From what you say above, there is
such a thing but its use is discouraged.

Clearly this problem is not specific to Emacs, so what do people do?
Hold on to iso-2022 for as long as they can (like we do in Emacs)?
Give up on these "details" of rendering for files using a mix of C, J, and K?
Rely on higher-level info (XML tags and friends) to carry the charset info?


Stefan
Clément Pit-Claudel
2018-04-20 21:02:16 UTC
Permalink
Post by Stefan Monnier
Rely on higher-level info (XML tags and friends) to carry the charset info?
I think that's what people typically do, yes. The table at https://en.wikipedia.org/wiki/Variant_Chinese_character#Usage_in_computing is a good example of using the lang and xml:lang attributes.
Paul Eggert
2018-04-20 21:26:10 UTC
Permalink
Post by Stefan Monnier
Clearly this problem is not specific to Emacs, so what do people do?
Hold on to iso-2022 for as long as they can (like we do in Emacs)?
Give up on these "details" of rendering for files using a mix of C, J, and K?
Rely on higher-level info (XML tags and friends) to carry the charset info?
For most uses, people typically just use UTF-8 and give up on the
details, which tend to be in areas that many users don't care much about
anyway. In practice if (say) a Japanese reader sees a Chinese quotation
in a page of Japanese text, there's an excellent chance the reader won't
much mind that the Chinese characters are rendered in Japanese-style, as
this has long been common practice in Japanese printing anyway.

There are of course exceptions where it really matters which font you
use, such as the Wikipedia page on Chinese character variants that
Clément mentioned. But these are rare, and are typically handled by
means other than plain text. It's like the Wikipedia page on kerning,
which uses images rather than plain UTF-8 text to illustrate how to kern
characters properly.

I mildly prefer multilingual text to be rendered in a consistent style
for my language, as opposed to having it rendered separately for readers
of each of its component languages, as this makes the text a bit easier
for me to read (which is the point of text, isn't it?). But this of
course is merely a style preference.

For what it's worth, the April 2018 w3techs.com numbers say that UTF-8
is used by 91.3% of websites whose character encoding they know, and
that this number is steadily growing (it was 88.9% a year ago). In
contrast, ISO 2022 usage is declining steadily. Of course the web is not
the entire universe; still, it's pretty clear which way the world is going.
Eli Zaretskii
2018-04-21 07:07:38 UTC
Permalink
Date: Fri, 20 Apr 2018 16:42:02 -0400
Post by Eli Zaretskii
The whole point of ISO-2022 is that the same Unicode codepoints can
come from different ISO-2022 charsets, and the ISO-2022 encoding keeps
that information in the bytestream.
My question was meant to see if there's a way to encode a similar kind
of charset info into the bytestream. From what you say above, there is
such a thing but its use is discouraged.
If you mean a Unicode-compatible bytestream, then yes, that's the
feature I know of. But if we want to use it in Emacs, we should
modify the UTF-x decoders to put the charset properties on the decoded
text, or invent a new property (since charset is currently 'unicode'),
and then augment the font selection code to consider that new
property.
Clearly this problem is not specific to Emacs, so what do people do?
Hold on to iso-2022 for as long as they can (like we do in Emacs)?
Give up on these "details" of rendering for files using a mix of C, J, and K?
Rely on higher-level info (XML tags and friends) to carry the charset info?
I don't know. Several years ago, I think each vendor used a private
extension of ISO-2022 to support the emoji, not sure if that is still
the case, especially since the number of standardized emoji continues
to grow all the time. We could perhaps follow one such extension in
our support of ISO-2022. Or we could decide that the Han unification
has conquered the world, and therefore the CJK charset distinction for
font selection is no longer important enough for us, in which case we
could recode HELLO in UTF-8.

I've added Handa-san to this discussion in the hope that he could
comment on what would be the bets way forward.
Michael Welsh Duggan
2018-04-21 14:58:53 UTC
Permalink
Post by Eli Zaretskii
Date: Fri, 20 Apr 2018 16:42:02 -0400
Post by Eli Zaretskii
The whole point of ISO-2022 is that the same Unicode codepoints can
come from different ISO-2022 charsets, and the ISO-2022 encoding keeps
that information in the bytestream.
My question was meant to see if there's a way to encode a similar kind
of charset info into the bytestream. From what you say above, there is
such a thing but its use is discouraged.
If you mean a Unicode-compatible bytestream, then yes, that's the
feature I know of. But if we want to use it in Emacs, we should
modify the UTF-x decoders to put the charset properties on the decoded
text, or invent a new property (since charset is currently 'unicode'),
and then augment the font selection code to consider that new
property.
Clearly this problem is not specific to Emacs, so what do people do?
Hold on to iso-2022 for as long as they can (like we do in Emacs)?
Give up on these "details" of rendering for files using a mix of C, J, and K?
Rely on higher-level info (XML tags and friends) to carry the charset info?
I don't know. Several years ago, I think each vendor used a private
extension of ISO-2022 to support the emoji, not sure if that is still
the case, especially since the number of standardized emoji continues
to grow all the time. We could perhaps follow one such extension in
our support of ISO-2022. Or we could decide that the Han unification
has conquered the world, and therefore the CJK charset distinction for
font selection is no longer important enough for us, in which case we
could recode HELLO in UTF-8.
I would suppose that the usual way to do this (encode glyph variants in
a Unicode-compatible bytestream) would be to use some form of document
markup. In Emacs's case, enriched-mode would seem an ideal candidate
for this. RFC-1896 specifically supports private extensions for
attributes using the "X-" syntax, and enriched.el is small and should be
simple to modify for this purpose.
--
Michael Welsh Duggan
(***@md5i.com)
Michael Albinus
2018-04-20 17:39:21 UTC
Permalink
Post by Eli Zaretskii
Because we don't have infrastructure for tagging sub-ranges of Unicode
with character sets (and in some sense, that would make little sense,
because Unicode is a unifying encoding).
ISO-2022 has built-in features to tag portions of text as belonging to
some specific charset.
Thanks for the explanation. As said I have no knowledge about the topic,
but I'm still surprised that something like this isn't possible with utf-8.

Best regards, Michael.
Eli Zaretskii
2018-04-21 07:10:39 UTC
Permalink
Date: Fri, 20 Apr 2018 19:39:21 +0200
I'm still surprised that something like this isn't possible with utf-8.
UTF-8 cannot encode language-specific differences of a given
character, that is something that is against the basic principle of
Unicode: that each character has one and only one encoding.

This is one reason why Emacs uses a superset of Unicode in its
internal representation, btw.
Clément Pit-Claudel
2018-04-21 14:40:41 UTC
Permalink
Post by Eli Zaretskii
UTF-8 cannot encode language-specific differences of a given
character, that is something that is against the basic principle of
Unicode: that each character has one and only one encoding.
Aren't variation selectors used for a similar purpose, though? (Maybe I'm misunderstanding what they are for).
Eli Zaretskii
2018-04-21 15:43:23 UTC
Permalink
Date: Sat, 21 Apr 2018 10:40:41 -0400
Post by Eli Zaretskii
UTF-8 cannot encode language-specific differences of a given
character, that is something that is against the basic principle of
Unicode: that each character has one and only one encoding.
Aren't variation selectors used for a similar purpose, though?
I'm not sure, but I don't think so. The variation selectors specify
glyphs, not font selection. But I admit I don't know enough about
this, so I might be mistaken.

(We use variation selectors only in macfont.m.)
Paul Eggert
2018-04-21 15:52:08 UTC
Permalink
Post by Clément Pit-Claudel
Aren't variation selectors used for a similar purpose, though?
They can be used for that, yes, though it's safe to say this would be
bleeding-edge stuff. As I understand it, Adobe and others use them so that one
can round-trip from Adobe formats into UTF-8 and back without losing information
about ideograph variants. However, in practice variation selectors tend to be
proprietary, so they're a bit of a minefield.

As far as etc/HELLO goes, a couple of years ago Ken Lunde proposed the PanCJKV
ideographic variation database collection for east-Asia variaton; see:

https://github.com/adobe-type-tools/pancjkv-ivd-collection

It ran into some roadblocks, though, briefly described here:

http://www.unicodeconference.org/presentations/S8T2-Lunde.pdf
Stefan Monnier
2018-04-23 02:53:58 UTC
Permalink
Post by Eli Zaretskii
UTF-8 cannot encode language-specific differences of a given
character, that is something that is against the basic principle of
Unicode: that each character has one and only one encoding.
But along the way they discovered that it's sometimes difficult to
decide whether two "things" should be consider as one and the same
character or not. They ended up with a set of "rules" to make those
decisions, but it's not nearly as simple as "each character has one and
only one encoding".


Stefan
Eli Zaretskii
2018-04-23 15:07:09 UTC
Permalink
Date: Sun, 22 Apr 2018 22:53:58 -0400
Post by Eli Zaretskii
UTF-8 cannot encode language-specific differences of a given
character, that is something that is against the basic principle of
Unicode: that each character has one and only one encoding.
But along the way they discovered that it's sometimes difficult to
decide whether two "things" should be consider as one and the same
character or not. They ended up with a set of "rules" to make those
decisions, but it's not nearly as simple as "each character has one and
only one encoding".
Not sure what you allude to here. Are you talking about the variation
selectors?
Stefan Monnier
2018-04-23 15:23:39 UTC
Permalink
Post by Eli Zaretskii
Post by Stefan Monnier
But along the way they discovered that it's sometimes difficult to
decide whether two "things" should be consider as one and the same
character or not. They ended up with a set of "rules" to make those
decisions, but it's not nearly as simple as "each character has one and
only one encoding".
Not sure what you allude to here.
For example the fact that some CJK characters should be displayed
differently depending on whether they're part of a C text, or a J text,
or a K text, so are they really "one and the same character"?

Of course, there are other related choices: which versions of β should
be one and the same and which shouldn't (e.g. I currently see in Unicode
a greek and a latin version plus some variants of a math version (tho
none in "roman" shape))?

There are murky areas, with no "one right answer", although Unicode has
had to choose somehow, i.e. doing the best it can with a messy situation.


Stefan
Eli Zaretskii
2018-04-23 16:12:14 UTC
Permalink
Date: Mon, 23 Apr 2018 11:23:39 -0400
Post by Eli Zaretskii
Post by Stefan Monnier
But along the way they discovered that it's sometimes difficult to
decide whether two "things" should be consider as one and the same
character or not. They ended up with a set of "rules" to make those
decisions, but it's not nearly as simple as "each character has one and
only one encoding".
Not sure what you allude to here.
For example the fact that some CJK characters should be displayed
differently depending on whether they're part of a C text, or a J text,
or a K text, so are they really "one and the same character"?
This situation existed before Unicode. Unicode tries to overcome it;
thus "Han unification".
Paul Eggert
2018-04-20 16:56:05 UTC
Permalink
Post by Michael Albinus
No problem to revert the patch, it isn't important.
If you revert it, please also revert commit
0585bd643dae2592214e77998b875347e6e59bab, which I installed before
seeing this thread.

It's true that this isn't important. Still, I like the the "hello"
emoji; it's friendly.
Michael Albinus
2018-04-20 17:37:30 UTC
Permalink
Post by Paul Eggert
Post by Michael Albinus
No problem to revert the patch, it isn't important.
If you revert it, please also revert commit
0585bd643dae2592214e77998b875347e6e59bab, which I installed before
seeing this thread.
Done. I've reverted 0585bd643dae2592214e77998b875347e6e59bab and
c4cfb5d20487f9912f5896b3f1d291fe7ccc9804. I haven't reverted
e2ae724460e6d73d3ddcc6066427471799c4bd57, because Stefan did commit a
better patch on top of this.
Post by Paul Eggert
It's true that this isn't important. Still, I like the the "hello"
emoji; it's friendly.
Yes, that was the idea. It's a pity that we cannot add valid utf-8
characters to etc/HELLO, when they are not iso-2022-7bit compatible.

Best regards, Michael.
Juri Linkov
2018-04-21 20:31:22 UTC
Permalink
Post by Michael Albinus
Post by Paul Eggert
It's true that this isn't important. Still, I like the the "hello"
emoji; it's friendly.
Yes, that was the idea. It's a pity that we cannot add valid utf-8
characters to etc/HELLO, when they are not iso-2022-7bit compatible.
I don't understand why it's impossible to create a charset like the
existing mule-unicode-e000-ffff but for character range over U+FFFF
to include such characters as U+1F44B. Or is this an inherent limitation
of the iso-2022-7bit coding system?
Eli Zaretskii
2018-04-23 16:25:41 UTC
Permalink
Date: Sat, 21 Apr 2018 23:31:22 +0300
I don't understand why it's impossible to create a charset like the
existing mule-unicode-e000-ffff but for character range over U+FFFF
to include such characters as U+1F44B. Or is this an inherent limitation
of the iso-2022-7bit coding system?
I'm not sure I understand your proposal. Are you suggesting to create
a Mule charset covering just the Emoji block? That could be possible
(assuming ISO-2022 still has vacant charset slots available, something
that I don't think I know how to determine reliably, and assuming we
decipher the black art of using define-charset). But is this worth
doing it just for Emoji?

If you mean to add a larger range of characters, then I think a single
ISO-2022 compatible charset can support at most 8192 character, so we
will need a lot of charsets to cover codepoints between U+10000 and
U+2FFFF, and I'm not sure we have that many vacant slots.

Or did you mean to suggest something else?
Juri Linkov
2018-04-23 20:05:08 UTC
Permalink
Post by Eli Zaretskii
Post by Juri Linkov
I don't understand why it's impossible to create a charset like the
existing mule-unicode-e000-ffff but for character range over U+FFFF
to include such characters as U+1F44B. Or is this an inherent limitation
of the iso-2022-7bit coding system?
I'm not sure I understand your proposal. Are you suggesting to create
a Mule charset covering just the Emoji block? That could be possible
(assuming ISO-2022 still has vacant charset slots available, something
that I don't think I know how to determine reliably, and assuming we
decipher the black art of using define-charset). But is this worth
doing it just for Emoji?
If you mean to add a larger range of characters, then I think a single
ISO-2022 compatible charset can support at most 8192 character, so we
will need a lot of charsets to cover codepoints between U+10000 and
U+2FFFF, and I'm not sure we have that many vacant slots.
Or did you mean to suggest something else?
This is exactly what I meant. While using ISO-2022 encoding in HELLO
to represent Unicode characters is just an inconvenience, the inability
to encode all Unicode characters in ISO-2022 is a serious limitation.
Loading...