Support for transferring sparse files via scp/sftp correctly?
Roland Mainz
roland.mainz at nrubsig.org
Wed Mar 5 22:03:23 AEDT 2025
On Wed, Mar 5, 2025 at 11:14 AM Darren Tucker <dtucker at dtucker.net> wrote:
> On Tue, Mar 04, 2025 at 02:43:10PM +0100, Lionel Cons wrote:
> [...]
> > Really: Built In sparse file support, which is on by default, makes
> > more sense, as we do not have to maintain&update&administer lots of
> > tools just to get the job done. It's also less error-prone.
> >
> > FYI Sparse files are nothing new or magic, they have been around since
> > the dawn of filesystems, and even WinXP&WinServer2000 have sparse file
> > support.
>
> I wasn't aware that the SEEK_HOLE and SEEK_DATA had even been
> standardised, although it looks like that was only some time last year.
> As others have noted it's still not universally available.
>
> Having looked at it:
[snip]
> - I don't see sufficient information available in the sftp protocol
> from the server to the client to support it for client "get".
> Certainly the secsh-filefxer-02[0] (ie v3) version that OpenSSH
> implements doesn't, but even the most recent -13 drafts only seem
> to support only a per-file boolean that indicates if it's on or off.
> I don't see a way for a client to determine the location and/or size of
> any holes in a remote file in order to replicate them on a downloaded
> file. The only way I can see it could be supported is by adding a
> vendor extension (which would need to be supported by both client
> and server) that could supply the information about holes/extents,
> which would be a larger undertaking.
FYI NFSv4.2 added a READ_PLUS operation to implement reading sparse
files efficiently:
Basically each time the NFSv.2 client does a READ_PLUS (see
https://datatracker.ietf.org/doc/html/rfc7862#page-86) the server
returns an array of elements. Each element can either be a "data"
element with { data_offset, len, data } XOR an "hole" element with {
hole_offset, hole_len }.
IMHO the sftp protocol could do the same.
There is even one optimisation:
The standards people also debated whether to support a 3rd return
type, an "application data block" (see
https://datatracker.ietf.org/doc/html/rfc7862#section-15.12), which
consists of a { data_offset, data_total_length, data_pattern_bytes,
data_pattern_bytes_len, number_of_data_patterns }.
The idea is that if an application stores repeated patterns in a file
the server can return this as "ADB". Use case examples include filling
files with 0xdeadbeef to invalidate data, or fill everything with
'\0'-bytes (which are valid data, holes in sparse files mean "no here
here").
But that is just an optimization, and IMHO the current sftp/scp-work
support for sparse files should handle only support for holes+data -
but maybe keep the protocol flexible enough to (later!) add ADB
support.
> +/*
> + * Check a potentially-sparse file for location of holes and data, starting
> + * from "offset". If the next hole points to EOF, there are no remaining holes.
> + */
> +static void
> +sftp_check_sparse_file(int fd, off_t offset, off_t *data_offset,
> + off_t *hole_offset)
> +{
> +#if defined(SEEK_HOLE) && defined(SEEK_DATA)
> + if ((*hole_offset = lseek(fd, offset, SEEK_HOLE)) == -1)
> + fatal_f("lseek(SEEK_HOLE): %s", strerror(errno));
> + if ((*data_offset = lseek(fd, offset, SEEK_DATA)) == -1)
> + fatal_f("lseek(SEEK_DATA): %s", strerror(errno));
> +#else
> + /* No sparse file support, assume data spans start to end. */
> + *data_offset = offset;
> + if ((*hole_offset = lseek(fd, offset, SEEK_END)) == -1)
> + fatal_f("lseek(SEEK_SET): %s", strerror(errno));
> +#endif
> + if (lseek(fd, offset, SEEK_SET) == -1) /* restore cursor */
> + fatal_f("lseek(SEEK_SET): %s", strerror(errno));
> + debug3_f("offset %llu data_offset %llu hole_offset %llu",
> + (unsigned long long)offset, (unsigned long long)*data_offset,
> + (unsigned long long)*hole_offset);
[snip]
Notes (I didn't test the code yet):
* Code for this must cover:
- Normal, non-sparse files
- Sparse files which only consists of a single hole, no data
- Sparse files which begin with a hole
- Sparse files which begin with data
- Sparse files which end with a hole
- Sparse files which end with data
- Sparse files with 60000 holes (no joke, at SUN we had customers who
had many more holes in files)
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
More information about the openssh-unix-dev
mailing list