Support for transferring sparse files via scp/sftp correctly?
Lionel Cons
lionelcons1972 at gmail.com
Sat Apr 5 08:59:44 AEDT 2025
On Fri, 4 Apr 2025 at 07:07, Ron Frederick <ronf at timeheart.net> wrote:
>
> On Apr 3, 2025, at 6:02 PM, Darren Tucker <dtucker at dtucker.net> wrote:
> > On Sat, 29 Mar 2025 at 16:14, Ron Frederick <ronf at timeheart.net <mailto:ronf at timeheart.net>> wrote:
> >> [...]
> >> If you don’t get all of the requested ranges in a single request, additional requests can be sent starting at just past the end of the last range previously returned.
> >>
> >> What do you think?
> >
> > That seems like it'd work well for things with SEEK_HOLE or equivalent, although there's always the chance of the underlying file changing between mapping it out and doing the transfer.
>
> Since my last message, I’ve also implemented support for this in Windows, which has a DeviceIOControl called FSCTL_QUERY_ALLOCATED_RANGES that returns an array of offset and length values, within a given range in a file (also specified by offset and length). So, it’s almost a direct mapping to the extension I proposed. I basically have three different versions of a request_ranges() function (Windows, systems with SEEK_DATA/SEEK_HOLE, and a dummy implementation for all other platforms which just returns the full range passed in).
>
> The risk of missing data due to file changes is no different than what could happen if you were reading data sequentially and something did a write to the source file after you had already copied that part of the file.
>
>
> > Damien pointed out that it's possible to do a reasonable but not perfect sparse file support by memcmp'ing your existing file buffer with a block of zeros and skipping the write if it matches. OpenBSD's cp(1) does this (look for "skipholes"): https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/bin/cp/utils.c?annotate=HEAD.
This should not be done. Either a system has SEEK_DATA/SEEK_HOLE,
Win32 (Windows&ReactOS) FSCTL_QUERY_ALLOCATED_RANGES, or just copy all
bytes.
The misunderstanding is that sequences of 0x00 bytes are automatically
holes. That is not true. Holes represent ranges of "no data", and only
for backwards compatibility read as 0x00 bytes. Valid data ranges can
contain long sequences of 0x00 bytes, therefore PLEASE don't invent
extra holes in sparse files just because they are sequences of 0x00
bytes.
Lionel
More information about the openssh-unix-dev
mailing list