Parallel transfers with sftp (call for testing / advice)

Wed May 6 00:42:30 AEST 2020

On May 5, 2020, at 3:55 AM, Cyril Servant <cyril.servant at gmail.com> wrote:
> Peter Stuge wrote:
>> 
>> Matthieu Hautreux wrote:
>>> The change proposed by Cyril in sftp is a very pragmatic approach to 
>>> deal with parallelism at the file transfer level. It leverages the 
>>> already existing sftp protocol and its capability to write/read file 
>>> content at specified offsets. This enables to speed up sftp transfers 
>>> significantly by parallelizing the SSH channels used for large 
>>> transfers. This improvement is performed only by modifying the sftp 
>>> client, which is a very small modification compared to the openssh 
>>> codebase. The modification is not too complicated to review and validate 
>>> (I did it) and does not change the default behavior of the cli.
>> 
>> I think you make a compelling argument. I admit that I haven't
>> reviewed the patch, even though that is what matters the most.
> 
> If you want to review the code, here is a direct link to the patch:
> https://github.com/openssh/openssh-portable/compare/V_8_2_P1...cea-hpc:V_8_2_P1_PSFTP1 <https://github.com/openssh/openssh-portable/compare/V_8_2_P1...cea-hpc:V_8_2_P1_PSFTP1>

I haven’t reviewed the patch in detail, but one thing that jumped out at me when looking at this was your choice to use multiple channels to implement the parallelism. The SFTP protocol already allows multiple outstanding reads or writes on a single channel, without waiting for the server’s response. I don’t know if the current OpenSSH client takes advantage of that or not, but it seems to me that you’d get the same benefit from that as you would from opening multiple channels with less setup cost (not having to open the additional channels, and not having to do separate file open calls on each channel). It might also make some other things like the progress bar support easier, as all of the responses would be coming back on the same channel.

You can find an example of this form of parallelism (multiple read/write requests outstanding on a single channel) in AsyncSSH’s SFTP implementation. By default, it will allow up to 128 parallel reads or writes of 16 KB each, and you can adjust both the block size and max outstanding requests via “block_size” and “max_requests” arguments in the calls to methods such as get(), put(), and copy() in the asyncssh.SFTPClient class (https://asyncssh.readthedocs.io/en/latest/api.html#sftpclient <https://asyncssh.readthedocs.io/en/latest/api.html#sftpclient>). Here’s an example client:

import asyncio, asyncssh, sys 

async def run_client():
    async with asyncssh.connect('localhost') as conn:
        async with conn.start_sftp_client() as sftp:
            await sftp.get('example.txt')

try:
    asyncio.get_event_loop().run_until_complete(run_client())
except (OSError, asyncssh.Error) as exc:
    sys.exit('SFTP operation failed: ' + str(exc))

The block_size and max_requests can be added as optional arguments to the sftp.get() in this example, but parallelism will automatically be used for any file larger than 16 KB, allowing for up to 2 MB of data to be outstanding at once (128 requests of 16 KB each), without any threading and all over a single SFTP channel.

This parallel I/O support in AsyncSSH also extends to clients making large read() or write() calls on a remote file. If you call read() or write() with a length of more than the block_size set when a file is opened, these reads or write will automatically be broken into multiple parallel requests. This behavior can be disabled by setting block_size to 0 if you really need to control the exact requests sent on the channel, but by default callers will get the benefit of the extra speed.
-- 
Ron Frederick
ronf at timeheart.net