'bytecompare' only applies the original Levenshtein distance algorihtm
to the ascii strings. It's not enabled by default.
As we agreed, it has some pleasant side effect for East Asian. It's not right
and wrong thing. Just side effect of UTF-8 encoding.
You can't increase character weight of certain race or religion in Emacs.
It's politically wrong.
So I suggest keep both unicode and ascii algorithm.
at the end of day.
Date: Tue, 17 Apr 2018 00:29:39 +1000
To be honest I'm 100% confident about my current solution. I'm
actually an expert in dealing unicode related C programming.
I appreciate your expertise, that's not the issue here. The main
issue is how to support all the range of characters that Emacs wants
It's standard practice to convert UTF-8 character to UTF-16
character before any further string operation.
To anwser your questions, I have to introduce East Asian culture at first.
China/Japan/Korea use same Hanzi (Chinese characters). China's
national standard GBK defines
21886 characters used Eas Asians.
UTF-16 is compatilbe with GBK.
Even most educated Chinese knows only 6000~8000 characters.
UTF-32 exists because there are another 50000 ancient Hanzi which no
Among 99089 characters of Unicode v5.0, 71226 characters are Hanzi.
In short, you got about 22000 non-hanzi characters and 21000 Hanzi
characters. UTF-16 is totally fine
to deal with it.
UTF-16 is fine for characters within the BMP, yes. But Emacs supports
much more, see below.
Say my name is 陈斌, my colleague's name is 张斌. Only one Hanzi difference, right?
While reading file named "陈斌's doucment0", I want to list other
documents with similar file names.
Using my byte compare algorithm, "陈斌's document1", "陈斌's document2" is
on the top of the sorted
document list. Make sense.
Without byte comparing, my family name "陈" is not more important than
an latin character
So files like "张斌's document1", "张斌's document2" is on the top of the list now.
You are saying that a difference of 1 character has more "weight" for
some characters than for others. That might be so, but using the
number of bytes in the UTF-8 representation as that weight makes very
little sense to me, because the length of the UTF-8 sequence for a
it depends on when was a particular character introduced into Unicode.
It has nothing to do with "importance" of a character.
Thus, your example might produce a sensible result when comparing
ASCII characters vs non-ASCII characters, but will make much less
sense when comparing one non-ASCII character with another non-ASCII
character. For example, emoji characters, which need 4 bytes in UTF-8
sequences, will be considered twice as "important" as the above Hanzi
characters of your example (and 4 times as important as ASCII). The
results will be totally unexpected and unpredictable (unless the user
knows the entire Unicode database by heart).
So I think features such as what you want to achieve with the above
example will have to use some other additional feature, like "weighted
difference", and using the byte length as a kind of surrogate "weight"
for this purpose is IMO not a good idea, even though it could look
like that at first glance.
2. As mentioned, China has national standard GBK which is compatible
Every character is a 16 bit integer. It's common practice to convert
UTF-8 string (or other encoded string)
to UTF-16 format before any further string operation. Or else, it's
impossbile to display Japanese, Korean, Chinese
in one document (considering tyical user operations, text
I've been doing this since day one of my career because I'm Chinese.
As I said, this is okay for characters in the BMP. But Emacs must
support much more.
First, we support the latest Unicode 10.0, which added a lot of
characters outside the BMP. You have the emoji there, you have many
new symbols there (the Alchemichal Symbols, Geometric Shapes, and
Supplemental Arrows-C blocks), Domino Tiles, Mathematical
Alphanumerics, Musical symbols, and many scripts that there's no
reason to tell people we don't support in Emacs.
Moreover, Emacs's internal representation of raw 8-bit bytes uses the
area #x3FFF00..#x3FFFFF, which need 5 bytes in its UTF-8 sequence. We
cannot exclude the possibility that strings we are comparing have
embedded raw bytes in them, this happens in practice and must work as
The integer is called wide character (wchar_t in standard C, or WCHAR
in Windows GUI SDK),
Windows indeed uses a 16-bit wchar_t type, which is why its support
for characters beyond the BMP is problematic and in some cases simply
non-existent, because standard C library functions that accept a
single wide character argument are unable to cope with UTF-16 encoding
that requires 2 16-bit words to represent a character. So, for
example, you don't have a way of looking for a character outside the
BMP in a wide character string by calling the standard function
wcschr, because its 2nd argument is a single wchar_t wide character,
which can only express characters inside the BMP.
Emacs cannot tolerate such limitations, not in the year 2018.
So using implementations that only support the BMP is really a
non-starter for Emacs. We cannot accept such code.
Now allow me explain why I can't use CHAR_STRING_ADVANCE. The first
parameter 'c' is ALREADY a wide character.
I apologize for my stupid typo. I meant STRING_CHAR_ADVANCE, not
CHAR_STRING_ADVANCE. At least I didn't make such a mistake in the 2nd
macro I mentioned, FETCH_STRING_CHAR_ADVANCE...
It's efficient to extract characters into one unsigned short array in
It is indeed quite efficient, but it requires one more pass of the
entire string, which is unnecessary. And anyway, the main reason not
to convert to UTF-16 is what I wrote above about supporting characters
beyond the BMP, not the extra loop.
Or else, I need convert character index to byte offset first. So I
need get byte offset by calling 'string_char_to_byte'
twice to get offset of two elements for comparing, The element coould
could be 1, 2, 3 bytes. So I still need convert bytes
into a 4 byte integer for comparing.
The macros I mentioned already do that for you, they manage both
character offsets and byte offsets efficiently (and they are much more
efficient than string_char_to_byte because they advance sequentially,
whereas string_char_to_byte is for random access to a string).
P.S. Please, let's return this discussion to the list, where it
belongs. I'm disappointed that this was again a private email, and
the list isn't CC'ed. This discussion should be public, not private.