[portawiki-discuss] fsync() and OS X
Peter Gutmann
pgut001 at cs.auckland.ac.nz
Tue Nov 1 01:41:46 EST 2005
Just found the following interesting snippet about the OS X fsync():
-- Snip --
From: Dominic Giampaolo <email at hidden>
Subject: Re: bad fsync? (A.M.)
Date: Sat, 19 Feb 2005 17:59:21 -0800
>MySQL makes the following claim at:
>http://dev.mysql.com/doc/mysql/en/news-4-1-9.html
>"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3
>and up. Apple had disabled fsync() in Mac OS X for internal disk
>drives, which caused corruption at power outages."
>First of all, is this accurate? A pointer to some docs or a tech note
>on this would be helpful.
The comments about fsync() are wrong...
On MacOS X, fsync() always has and always will flush all file data from host
memory to the drive on which the file resides. The behavior of fsync() on
MacOS X is the same as it is on every other version of Unix since the dawn of
time (well, since the introduction of fsync anyway :-).
I believe that what the above comment refers to is the fact that fsync() is
not sufficient to guarantee that your data is on stable storage and on MacOS X
we provide a fcntl(), called F_FULLFSYNC, to ask the drive to flush all
buffered data to stable storage. Let me explain in more detail. With fsync()
even though the OS writes the data through to the disk and the disk says "yes
I wrote the data", the data is not actually on permanent storage. Unless you
explicitly disable it, all disks have a write buffer which holds data you've
written. The disk buffers the data you wrote until it decides to flush it to
the platters (and the writes may not be in the order you wrote them). If you
lose power or the system crashes before the data is written, you can wind up
in a situation where only some of your data is actually on disk. What is
worse is that even if you write blocks A, B and C, call fsync() and then write
block D you may find after rebooting that blocks A and D are on disk but B and
C are not (in fact any ordering of A, B, C, and D is possible). While this may
seem like a rare case it is not. In fact if you sit down and pull the plug on
a system you can make it happen in one or two plug pulls. I have even gone so
far as to watch this behavior with a logic analyzer on the ATA bus: I saw the
data for two writes come across the ATA cable, the drive replied and said the
writes were successful and then when we rebooted the data from the second
write was correct on disk but the data from the first write was not. To deal
with this we introduced the F_FULLFSYNC fcntl which will ask the drive to
flush all of its buffered data to disk. When an app needs to guarantee that
data is on disk it should use F_FULLFSYNC. In most cases you do not need such
a heavy handed operation and fsync() is good enough. But in an app like a
database, it is essential if you want transactional integrity.
Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC with the
FLUSH_TRACK_CACHE command. All drives sold by Apple will honor this command.
Unfortunately quite a few firewire drive vendors disable this command and do
not pass it to the drive. This means that most external firewire drives are
not reliable if you lose power or the system crashes. We can't work-around
that unless we ask the drive to disable the write cache completely (which
hurts performance quite badly -- and even that may not be enough as some
drives will ignore that request too).
So in summary, I believe that the comments in the MySQL news posting are
slightly confused. On MacOS X fsync() behaves the same as it does on all
Unices. That's not good enough if you really care about data integrity and so
we also provide the F_FULLFSYNC fcntl. As far as I know, MacOS X is the only
OS to provide this feature for apps that need to truly guarantee their data is
on disk.
Hope this clears things up.
--dominic
-- Snip --
and:
-- Snip --
Subject: Re: Question about fsync and fcntl with F_FULLSYNC
From: Chris Sarcone <email at hidden>
Date: Tue, 21 Jun 2005 11:09:31 -0700
Andrew --
> According to the man page F-FULLSYNC doesn't work with some firewire drives,
> is that the only limitation?
No. USB devices seem to have the same issues. In addition, some RAID
controllers do not honor the SCSI SYNCHRONIZE_CACHE or ATA FLUSH CACHE / FLUSH
CACHE EXT either. Why some bridges/controllers don't do this is not known. It
could be because they never did an initial implementation. It could be because
some website/magazine reviewed their bridge against another (which didn't do
the flush either) and mentioned the performance was worse, so in the next
revision they removed the support...Optimizing throughput at the cost of data
integrity seems to be a foolish tradeoff. We have been working with many
manufacturers to make sure their devices do this properly since journaling and
other transaction oriented things (databases) require it.
The good news is, there is a workgroup part of T10 which is defining SCSI->ATA
Translation (mostly for SAS controllers that will be compatible with Serial-
ATA drives). However, we'd like to see the bridge manufacturers follow the
guidelines established by this workgroup once the proposal makes it to T10.
That will at least make things consistent across the board...
> I know Dominic has said that it will work with all Apple provided drives but
> will it work with third party ATA drives?
Most, if not all, of the Serial-ATA / Parallel ATA drive manufacturers do the
right thing today. For the most part, it's the non-bare drives that you have
to worry about (those behind bridges, those in a RAID box, etc.).
-- Chris
-- Snip --
If this is right then all Unixen should be affected, not just OS X>
Peter.
More information about the portawiki-discuss
mailing list