Print

Print


We had more filesystem problems with the email cluster today, but are
hopeful this is the last round of this type of problem.  The vendor of our
filesystem, Sistina, has provided us with a fix for a nasty bug they found
about a month ago (delay was due to development of fix, and QA testing).

The bug in Sistina's GFS filesystem has caused us repeated problems over
the last two months, and has been a major contributor to most, if not all,
of our downtime during that time-frame.  This bug was introduced into their
code-base in December, about a month before we went production with the new
email cluster.

Today, email was down for about 90 minutes for everyone, but then
available for half of the users.  Email was restored by around 12:30 for
everyone.

The filesystem that had corruption issues this morning has been completely
fixed, and we've installed the GFS fixes on all the email cluster nodes,
so we shouldn't have this problem again.

For the filesystems that didn't have problems today, we will pro-actively
migrate data off onto a new filesystem, just in case there are lingering
problems in the filesystem left from this bug.  We'll migrate users at
night over the next week.  The migration should be transparent for most
users.

Hopefully this patch will fix the repeated problems we've had with
filesystem reliability.

Thanks,
mga.