[patch] scp + UTF-8
Ingo Schwarze
schwarze at usta.de
Wed Jan 20 12:05:00 AEDT 2016
Hi Roland,
Roland Mainz wrote on Wed, Jan 20, 2016 at 01:13:10AM +0100:
> Some generic portability comments:
> 1. There are other modern encodings like GB18030
Yes, but there are no plans to support any other encodings except
UTF-8 in the OpenBSD base system, so supporting other encodings
would be a matter for the portable version, if at all. I will
consider whether it is possible to write multibyte character support
in a way that doesn't result in obfuscation (and hence loss of
security) on OpenBSD and yet supports other encodings elsewhere,
but i'm not yet sure that will be possible. In case of the slightest
doubt, i expect OpenSSH developers will prioritize security over
additonal encoding support.
> (support is even mandatory for software sold to the goverment in
> PRC China)
I'm not aware of any plans to sell OpenSSH to the government of
China, but they are of course welcome to use it for free.
> 2. |wcwidth()| counts in terminal cells and not number of characters
> (where one character might occupy one or more bytes), e.g. there are
> characters which may occupy from zero to four terminal cells (acual
> number of cells is a bit (not much) OS specific).
I never heard about any characters occupying more than three cells.
As far as i know, the result of wcwidth(3) is not specified by the
Unicode standard, so i'm usually looking at the Perl implementation
as a reference. Last time i looked there, i didn't find any actual
characters occupying more than two cells, even though characters
of width three might in principle be possible.
> 3. I am not sure whether there is a specific byte limit for UTF-8
> in any of the standards,
Yes, current Unicode limits codepoints to U+0000 to U+10FFFF, which
limits UTF-8 to one to four bytes. But five and six byte UTF-8
sequences were considered in the past, so you are right that we
should make sure that nothing breaks if some system has bogus
support for those.
> e.g. "- To support terminals larger then MAX_WINSIZE and still be
> properly indented I increased the buf size to 4x the size
> of MAX_WINSIZE, since the maximum size of an UTF-8 char <should>
> be 4 bytes." might not be a portable assumption and I would
> at least safeguard it.
Yes, thank you for your comments, i have taken notes in my TODO file
to check that they will not be forgotten when reviewing future patches.
In particular the last one is quite important:
* scp(1) comments by Roland Mainz:
try to make things work even with non-UTF-8 outside OpenBSD, if easy
make sure nothing breaks for wcwidth(...) > 2
make sure nothing breaks for MB_CUR_MAX > 4
Yours,
Ingo
More information about the openssh-unix-dev
mailing list