We found the problem.
This summer when the new file server cluster started production we found a bug in the krb5 code that caused part of the file server's security process to leak open file handles. Every time an NFS client user would authenticate, the process (called rpc.svcgssd) would open the kerberos reply cache and never close it. Eventually, the process would run up against system limits for the number of open files and would fail to create new security contexts.
We had fixed the problem by patching Red Hats krb5 packages, but never excluded those fixed packages from being updated. Last night security updates overwrote those packages with version that contained this bug. Then it was only a matter of time before the number of open handles exceeded the limits and caused a failure. All three servers had this problem at roughly the same time this morning, which produced the outage for zoo, zoofiles, and the web farm.
So, this morning we've re-patched the newer krb5 packages and installed them. We've notified Red Hat of the problem through our support channel and by filing a bug report (https://bugzilla.redhat.com/show_bug.cgi?id=761006). Finally, we've excluded those packages from being updated unexpectedly.
Well... that was verbose. I thought someone might be interested in the root cause.. This sort of thing happens over here in SAA -- the combination of an oversight (I forgot to exclude those packages), plus an unexpected update (unexpected, but automated by design because it was a security update) caused this outage.
On Dec 7, 2011, at 6:58 AM, Benjamin Coddington wrote:
> SAA had a problem with its file servers this morning for 20 minutes at 6:20 AM. The problem caused outages for zoofiles, the web farm housing www.uvm.edu and other sites, and zoo.
> We suspect that software updates installed during the night caused our NFS servers to be unable to renew their credentials, so NFS filesystems failed on clients. We are investigating further to find the root of the problem and avoid it in the future.
> We apologize for the disruption in service.
> Benjamin Coddington
> Systems Architecture and Administration
> Enterprise Technology Services
> University of Vermont