 | Level: Introductory Daniel Robbins (drobbins@gentoo.org), President/CEO, Gentoo Technologies, Inc.
01 Dec 2001 With the 2.4 release of Linux come many new filesystem possibilities, including Reiserfs, XFS, GFS, and others. These filesystems sound cool, but what exactly can they do, what are they good at, and exactly how do you go about safely using them in a production Linux environment? Daniel Robbins answers these questions by showing you how to set up these new advanced filesystems under Linux 2.4. In this installment, Daniel continues his look at ext3, a new improved version of ext2 with journaling capabilities. He reveals all the inside information on ext3, and demonstrates some shockingly good ext3 data=journal interactive performance numbers.
I'm going to be honest. For this article, I was planning to show you how to
get ext3 up and running on your system. Although that's what I said I'd do,
I'm not going to do it. Andrew Morton's excellent "Using the ext3
filesystem in 2.4 kernels" page (see Resources later in this article)
already does a great job of explaining
how to ext3-enable your system, so there's no need for me to repeat all the basics
here. Instead, I'm going to delve into some meatier ext3 topics, ones that I
think you'll find very useful. After you read this article, when you're ready to get ext3 up and running, head over to Andrew's page.
2.4 kernel update
First, let's start with a 2.4 kernel update. I last discussed 2.4 kernel
stability when I was covering ReiserFS. Way back then, finding a stable 2.4
kernel was a challenge, and I recommended sticking with the known
and at that time bleeding-edge 2.4.4-ac9 kernel -- especially for anyone
planning to use the ReiserFS filesystem in a production environment. As you
might guess, a lot has happened since 2.4.4-ac9, and it's definitely time to
start looking at newer kernels.
With kernel 2.4.10, the 2.4 series reached a new level of performance and scalability (something
that we've been anticipating for a long time). So, what happened to allow Linux 2.4 to finally grow up? In an acronym, VM. Linus, recognizing that the 2.4 series wasn't performing
spectacularly, ripped out Linux's problematic VM code and replaced it with a
lean and mean VM implementation from Andrea Archangeli. Andrea's new VM
implementation (which first appeared in 2.4.10) was really great; it really sped up
the kernel and made the entire system more responsive. 2.4.10 was definitely
a major turning point in 2.4 Linux kernel development; up until then, things
weren't looking very good, and many of us were wondering why we weren't FreeBSD
developers. We all should thank Linus for his heroism in making such a major
(but sorely needed) change in the 2.4 stable kernel series.
Since Andrea's new VM code needed a bit of time to be integrated seamlessly
with the rest of the kernel, use 2.4.13+. Even better, use 2.4.16+, since
the rock-solid ext3 filesystem code was finally integrated into the official
Linus kernel starting with the 2.4.15-pre2 release. There's no reason to avoid using
2.4.16+ kernel, and it'll make your job of getting ext3 up and running that
much easier. If you do use a 2.4.16+ kernel, just remember that it's no
longer necessary to apply the ext3 patch as described on Andrew's
page (see Resources). Linus already added it for you :)
You'll notice that I recommend using 2.4.16+ rather than 2.4.15+, and with
good reason. With the release of kernel 2.4.15-pre9, a really ugly filesystem
corruption bug was introduced to the kernel. It took until 2.4.16-pre1 for the
problem to be identified and fixed, resulting in a span of kernels (including
2.4.15) that should be avoided at all costs. Choosing a 2.4.16+ kernel allows you
to avoid this bad batch entirely.
Laptops...beware?
Ext3 has a stellar reputation for being a rock-solid filesystem, so I was
surprised to learn that quite a few laptop users were having filesystem
corruption problems when they switched to ext3. In general, it's tempting to
react to these kinds of reports by avoiding ext3 entirely; however, after
asking around, I discovered that the disk corruption problems that people were
experiencing had nothing to do with ext3 itself, but were being caused by
certain laptop hard drives.
The write cache
You may not know this, but most modern hard drives have something called a
"write cache", used by the hard drive to collect pending write operations. By
putting pending writes into a cache, the hard drive firmware can then reorder
and group them so that they're written to disk in the fastest
possible way. The write cache is generally considered to be a very good thing (read Linus'
explanation and opinion of write caching in Resources).
Unfortunately, certain laptop hard drives now on the market have the dubious
feature of ignoring any official ATA request to flush their write cache to
disk. This isn't a wonderful design feature, although it has been
allowed by the ATA spec up until recently. With these types of drives, there's
no way for the kernel to guarantee that a particular block has actually been
recorded to the disk platters. Although this sounds like a thorny problem, this
particular issue by itself is probably not the cause of the
data corruption problems that people have been experiencing.
However, it gets worse. Some modern laptop hard drives have an even nastier
habit of throwing away their write cache whenever the system is rebooted
or suspended. Obviously, if a hard drive has both of these problems, it's going
to regularly corrupt data, and there's nothing that Linux can do to prevent
it from doing so.
So, what's the solution? If you have a laptop, tread carefully. Back up all
your important files before making any major change to your filesystems. If
you experience data corruption problems that seem to fit the pattern of what I
described above, particularly with ext3, then remember that it may be your
laptop hard drive that's at fault. In that case, you may want to contact your
laptop manufacturer and inquire about getting a replacement drive. Hopefully,
in a few months time, these flaky hard drives will be pulled from the market
and we'll never need to worry about this issue again.
Now that I've scared you out of your minds, let's take a look at ext3's various
data journaling options.
Journaling options and write
latency
Ext3 allows you to choose from one of three data journaling
modes at filesystem mount time: data=writeback,
data=ordered, and data=journal.
To specify a journal mode, you can add the appropriate string (data=journal, for example) to the options section of your
/etc/fstab, or specify the -o
data=journal command-line option when calling mount directly. If you'd like to specify the data
journaling method used for your root filesystem (data=ordered is the default), you can to use a special
kernel boot option called rootflags. So, if you'd
like to put your root filesystem into full data journaling mode, add rootflags=data=journal to your kernel boot options.
data=writeback
mode
In data=writeback mode, ext3 doesn't
do any form of data journaling at all, providing you with similar journaling
found in the XFS, JFS, and ReiserFS filesystems (metadata only). As I
explained in my previous article,
this could allow recently modified files to
become corrupted in the event of an unexpected reboot. Despite this drawback,
data=writeback mode should give you the best ext3
performance under most conditions.
data=ordered
mode
In data=ordered mode, ext3 only
officially journals metadata, but it logically groups metadata and data blocks
into a single unit called a transaction. When it's time to write the new
metadata out to disk, the associated data blocks are written first.
data=ordered mode effectively solves the corruption
problem found in data=writeback mode and most other journaled filesystems, and
it does so without requiring full data journaling. In general, data=ordered ext3 filesystems perform slightly slower than
data=writeback filesystems, but significantly faster
than their full data journaling counterparts.
When appending data to files, data=ordered
mode provides all of the integrity guarantees offered by ext3's full data
journaling mode. However, if part of a file is being overwritten and
the system crashes, it's possible that the region being written will contain a
combination of original blocks interspersed with updated blocks. This is
because data=ordered provides no guarantees as to
which blocks are overwritten first, so you can't assume that just because
overwritten block x was updated, that overwritten block x-1 was updated as
well. Instead, data=ordered leaves the write
ordering up to the hard drive's write cache. In general, this limitation
doesn't end up negatively impacting people very often, since file appends are
generally much more common than file overwrites. For this reason, data=ordered mode is a good higher-performance replacement
for full data journaling.
data=journal
mode
data=journal mode provides full data
and metadata journaling. All new data is written to the journal first, and
then to its final location. In the event of a crash, the journal can be
replayed, bringing both data and metadata into a consistent state.
Theoretically, data=journal mode is the slowest
journaling mode of all, since data gets written to disk twice rather than once.
However, it turns out that in certain situations, data=journal mode can be blazingly fast. Andrew Morton,
after hearing reports on LKML that ext3 data=journal
filesystems were giving people unbelievably great interactive filesystem
performance, decided to put together a little test. First, he created simple
shell script designed to write data to a test filesystem as quickly as
possible:
Rapid writing
while true
do
dd if=/dev/zero of=largefile bs=16384 count=131072
done
|
While data was being written to the test filesystem, he attempted to read 16Mb
of data from another ext2 filesystem on the same disk, timing the results:
Reading a 16Mb file
time cat 16-meg-file > /dev/null
|
The results were astounding. data=journal mode allowed the 16-meg-file to
be read from 9 to over 13 times faster than other ext3 modes, ReiserFS,
and even ext2 (which has no journaling overhead):
|
Written-to-filesystem
|
16-meg-read-time (seconds)
| | ext2 | 78 | | ReiserFS | 67 | ext3 data=ordered
| 93 | ext3 data=writeback
| 74 | ext3 data=journal
|
7
|
Andrew repeated this test, but tried to read a 16Mb file from the test
filesystem (rather than a different filesystem), and he got identical results.
So, what does this mean? Somehow, ext3's data=journal mode is incredibly well-suited to situations
where data needs to be read from and written to disk at the same time.
Therefore, ext3's data=journal mode, which was
assumed to be the slowest of all ext3 modes in nearly all conditions, actually
turns out to have a major performance advantage in busy environments where
interactive IO performance needs to be maximized. Maybe data=journal mode isn't so sluggish after all!
Andrew is still trying to figure out exactly why data=journal mode is doing so much better than everything
else. When he does, he may be able to add the necessary tweaks to the rest of
ext3 so that data=writeback and data=ordered modes see some benefit as well.
data=journal tweaks
Some people have had a particular performance problem
when using ext3's data=journal mode on busy servers -- busy NFS servers, in
particular. Every thirty seconds, the server experiences a huge storm of disk-writing activity, causing the system to nearly grind to a halt. If you
experience this problem, it's easy to fix. Simply type the following command
as root to tweak Linux's dirty buffer-flushing algorithm:
Tweaking bdflush
echo 40 0 0 0 60 300 60 0 0 > /proc/sys/vm/bdflush
|
These new bdflush settings will cause kupdate to run every 0.6 seconds rather
than every 5 seconds. In addition, they tell the kernel to flush a dirty buffer
after 3 seconds rather than 30, the default. By flushing recently-modified
data to disk more regularly, these write storms can be avoided. It's slightly
less efficient to do things this way, since the kernel will have fewer
opportunities to combine writes. But for a busy server, writes will happen
more consistently, and interactive performance will be greatly improved.
Conclusion
We've now concluded our coverage of ext3. Join me in my next article as we
explore the many wonders of...XFS!
Resources
About the author  | |  |
Residing in Albuquerque, New Mexico, Daniel Robbins is the President/CEO of
Gentoo Technologies, Inc., the creator of Gentoo Linux, an advanced Linux for the PC,
and the Portage system, a next-generation ports system for Linux. He
has also served as a contributing author for the Macmillan books
Caldera OpenLinux Unleashed, SuSE Linux
Unleashed, and Samba Unleashed. Daniel has been involved with computers in some fashion since the second grade, when he was first exposed to the Logo programming language as well as a
potentially dangerous dose of Pac Man. This probably explains why he has since
served as a Lead Graphic Artist at SONY Electronic Publishing/Psygnosis.
Daniel enjoys spending time with his wife, Mary, and their daughter, Hadassah. You can contact Daniel at drobbins@gentoo.org. |
Rate this page
|  |