Parallel transfers with sftp (call for testing / advice)

Sat Apr 11 08:41:38 AEST 2020

On Thu, Apr 9, 2020 at 11:01 AM Cyril Servant <cyril.servant at gmail.com> wrote:
>
> > Le 9 avr. 2020 à 00:34, Nico Kadel-Garcia <nkadel at gmail.com> a écrit :
> >
> > On Wed, Apr 8, 2020 at 11:31 AM Cyril Servant <cyril.servant at gmail.com> wrote:
> >>
> >> Hello, I'd like to share with you an evolution I made on sftp.
> >
> > It *sounds* like you should be using rparallelized rsync over xargs.
> > Partial sftp or scp transfers are almost inevitable in builk transfers
> > over a crowded network, and sftp does not have good support for
> > "mirroring", only for copying content.
> >
> > See https://stackoverflow.com/questions/24058544/speed-up-rsync-with-simultaneous-concurrent-file-transfers
>
> This solution is perfect for parallel sending a lot of files. But in the case of
> sending one really big file, it does not improve transfer speed.

It's helpful because it allows you to retry where the last
transmission failed, and it does not leave a partial upload sitting
there tempting people. It uploads to .filename-hash, and moves the
upload in place when the individual file upload is completed.

> >> I'm working at CEA (Commissariat à l'énergie atomique et aux énergies
> >> alternatives) in France. We have a compute cluster complex, and our customers
> >> regularly need to transfer big files from and to the cluster. Each of our front
> >> nodes has an outgoing bandwidth limit (let's say 1Gb/s each, generally more
> >> limited by the CPU than by the network bandwidth), but the total interconnection
> >> to the customer is higher (let's say 10Gb/s). Each front node shares a
> >> distributed file system on an internal high bandwidth network. So the contention
> >> point is the 1Gb/s limit of a connection. If the customer wants to use more than
> >> 1Gb/s, he currently uses GridFTP. We want to provide a solution based on ssh to
> >> our customers.
> >>
> >> 2. The solution
> >>
> >> I made some changes in the sftp client. The new option "-n" (defaults to 0) sets
> >> the number of extra channels. There is one main ssh channel, and n extra
> >> channels. The main ssh channel does everything, except the put and get commands.
> >> Put and get commands are parallelized on the n extra channels. Thanks to this,
> >> when the customer uses "-n 5", he can transfer his files up to 5Gb/s. There is
> >> no server side change. Everything is made on the client side.
> >
> > While the option sounds useful for niche cases, I'd be leery of
> > partial transfers and being compelled to replicate content to handle
> > partial transfers. rsync has been very good, for years, in completing
> > partial transfers.
>
> I can fully understand this. In our case, the network is not really crowded, as
> customers are generally using research / educational links. Indeed, this is
> totally a niche case, but still a need for us. The main use case is putting data
> you want to process into the cluster, and when the job is finished, getting the
> output of the process. There is rarely the need for synchronising files, except
> for the code you want to execute on the cluster, which is considered small
> compared to the data. rsync is the obvious choice for synchronising the code,
> but not for putting / getting huge amounts of data.
>
> The only other ssh based tool that can speed up the transfer of one big file is
> lftp, and it only works for get commands, not for put commands.

yeah, lftp can also support ftps. ftps is supporied by the vsftpd
FTP server, and I use it in places where I do not want OpenSSH
server's tendency ro let people with access look around the rest of
the filesystem.