[patch] scp + UTF-8

Wed Jan 20 10:54:41 AEDT 2016

Hi Darren,

Darren Tucker wrote on Wed, Jan 20, 2016 at 09:39:33AM +1100:
> On Wed, Jan 20, 2016 at 8:48 AM, Ingo Schwarze <schwarze at usta.de> wrote:

>> Martijn sent the following patch to me in private and agreed that i
>> post it here.
>>
>> In any other program in OpenBSD base, i'd probably agree with the
>> basic approach.  Regarding OpenSSH, however, i worry whether wcwidth(3)
>> can be used.  While wcwidth(3) is POSIX, it is not ISO C.  Does
>> OpenSSH target platforms that don't provide wcwidth(3)?

> OpenSSH nominally targets POSIX, but it builds on a wide enough range
> of platforms that it's likely at least some don't have it.
> 
> Our general approach is to target POSIX then implement any needed
> missing bits either by stealing the implementation from OpenBSD, some
> other BSD licensed source or writing from scratch.  If we have to
> we'll ifdef stuff but prefer not to.

Sure, that's what i expected.

>> If so, do you think the problem can be solved by simply providing
>> US-ASCII support only on such platforms, but no UTF-8 support at all?

> Yes.  That's what I did with mblen when we picked up a need for that
> via libedit for platforms with no wide character support.
> 
> $ grep -i mblen openbsd-compat/*.h
> openbsd-compat/openbsd-compat.h:#ifndef HAVE_MBLEN
> openbsd-compat/openbsd-compat.h:# define mblen(x, y)  (1)

Uh oh.  I'm not quite sure what consequences that might entail in
libedit for sftp(1), which does use setlocale(LC_CTYPE, "")?
Did you audit those consequences?

> Is there any reason the same approach would not work with wcwidth?

  #define wcwidth(x) (1)   /* NO!! */

would be a security risk.  One purpose of wcwidth(3) is to weed out
non-printable characters.  Whatever replacement we use, we have to
make sure it returns -1 for every non-printable character on every
platform.  We MUST NOT let the scp(1) progressmeter spew random
Unicode characters taken from the network at the user's terminal.
They might be control codes.

  #define wcwidth(x) (-1)   /* not really */

is not a security issue, but it would completely break filename
display even with the C/POSIX locale on those platforms.  I briefly
considered

  int
  wcwidth(wchar_t wc)   /* might break? */
  {
  	if (wc < 0x20 || wc > 0x7e)
		return -1;
	return isprint((unsigned char)wc) ? 1 : -1;
  }

But that isn't ideal either because as far as i know, ISO C
doesn't require that wchar_t is internally represented in a way
that puts ASCII in the range 0x00 to 0x7f.  Using iswprint(3)
is not a very good idea either because that is C99, not C89,
and may not be available either.

So if we can't get a real implementation of wcwidth(3) on some
platform, it's better to completely disable UTF-8 and only allow
US-ASCII.

A real replacement implementation of wcwidth(3) is MUCH harder
than a real replacement implementation of mbtowc(3) and mblen(3).
It needs a big table of character ranges (see tmux(1)), while
the latter can be done in 50 lines (see mandoc(1)).

That's why i said: If we want full UTF-8 support on all platforms
no matter what and must have a complete replacement wcwidth(3), we
should complete djm@'s character filtering first.

>> P.S.
>> This patch also uses mbtowc(3), but i assume that's no problem
>> because that's ANSI C.

> I would not assume that its existence in the standard is equal to its
> existence in all deployments :-)  That said it looks like we can
> implement it in libcompat if needed.

Yes, mbtowc(3) and mblen(3), certainly, but even those only for UTF-8,
not for any other locale.

Yours,
  Ingo