[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Escapes in string literals and locales



Eli Barzilay <eli@barzilay.org> writes:
>On May 16, Riku Saikkonen wrote:
>> Perhaps. Though, the parameter could have an intelligent
>> platform-specific default - even if it isn't always correct, it
>> could guess right in most cases.
>I would guess that finding a platform specific default is a nightmare
>-- you have Unix, DOS boxes, telnets, and don't forget the GUI which
>would change this defaults which is platform _and_ configuration
>dependent...

At least on Unix, the situation usually isn't that bad. The user's
locale settings (if they're correctly set) tell programs directly if
the user's current terminal supports ISO 8859-1. Then it's up to the
user and to the various terminal programs to set the locale (the LANG
environment variable, etc.).

I don't know too much about other platforms, but I think both Windows
and Mac tend to use their own fixed character sets, so on those
platforms the default could be simply the platform's native charset.

>> For example, Emacs on Unix seems to get away with expecting a locale
>> ending in "8859-1" to enable display of 8-bit characters on a
>> terminal by default...
>Emacs is a good example -- it took *years* to get things right, and
>it's not done yet.  (But they went for Unicode which is a bigger
>pain).

Unicode and the multicharset handling of Emacs is, I think, a separate
problem. For these string literals, we only need to know which
characters are printable - Emacs's multibyte support also needs to
know how to display each character (what fonts to use, whether to do
charset conversion, etc.) and how to detect character sets from files.

(Writing a string from MzScheme (using write) and then reading it on
another platform won't magically convert the character set, and this
would be more complex to implement. But I think it's a separate
issue.)


Hmm. It just occurred to me that GNU libc supports locales. Hmm, yes,
the GNU libc documentation says that isprint() does actually know
about the character set specified by the locale, if the locale has
been set up. So, if you use isprint(), the only thing you'd need to do
under Linux is to call setlocale(LC_ALL, ""); at the start of
MzScheme.

setlocale(3) appears to be standardised in POSIX.1, so the same call
should work in most other Unix systems too, though I'm not sure how
good their locale support is (so it might not do anything useful).
Note that setlocale(3) also affects other string-processing functions
(for example, toupper(3) knows to convert ä to Ä), but I don't know if
this is a problem - usually it does just the right thing (e.g., R5RS
doesn't forbid converting ä to Ä in char-upcase, and doing it is
usually more useful than assuming everything is in plain ASCII).

One possible problem in supporting locales this way is that it makes
the semantics of some primitive procedures more complex - you need to
know what a locale is to be able to guess whether ä is converted to Ä
by char-upcase. Though, in most cases you can just assume that it does
the Right Thing.

Perhaps the locale could be another parameter, like
  (support-internationalization #t) => setlocale(LC_ALL, "") and
  (support-internationalization #f) => setlocale(LC_ALL, "C").
(I think the latter setlocale turns the locale support off.)

>> (By the way, your mail user agent seems to have problems with MIME:
>Heh...  My email is VM under FSF Emacs.  It is one of the better
>emailers, and as you can see - it still has problems due to this whole
>charset mess...

:) I use Gnus with GNU Emacs, and it seems to work well (nowadays - it
also had MIME problems about a year or two ago).

-- 
-=- Rjs -=- rjs@lloke.dna.fi