from the it-always-happens-when-you-are-on-holidays dept.
Today from about 10:15am to 11am and from about 11:55am to 5:20pm there were connectivity problems with most of our servers because of a malfunctioning gateway. From 5:20pm to 5:30pm there were no network connection to the servers at all because of the rebooting and repairing of that gateway. Since 5:35pm all server connections should be okay again and since 8:15pm also mails should be delivered as soon as usual again.
Today around 10:15am we discovered miscalleanous problems in our network after we got informed by SMS from our Big Brother. But always when we tried to reproduced the reported problem it seem to have vanished or at least vanished soon after. But other problems popped up elsehwere nearly immediately. Since also the Informatikdienste seemed to have some connectivity problems, we assumed an ETH wide network problem. Around 11am the problems seemed to have gone away and we went back to doing what we've done before we got the SMS.
But around noon they came back and became more and more the longer they were present. Again we got informed by SMS shortyl after the first longer lasting problems.
We had lags and packet loss to most servers, especially the virus scanner and spam filter servers. So mails started accumulating in the incoming queue of our mail server instead of being scanned and delivered. At the end there were over 10'000 mails waiting in the incoming queue. File servers were slow but still usable if you (or your computer) were patient enough. But no additional network traffic could be found to explain those lags.
After checking which connections were ok and which were not, we found out that all the bad connections we saw went over our server gateway "schilt" and all which were ok, did not go over schilt. So the gateway to our server subnet seemed to be the core of the problem. We then also found some problems in the logfiles which explained the packet loss and therefore probably also all the other problems. Since the problems were located at a quite vital function, after a short discussion, we decided to reboot schilt and therefore cut off all our server for the duration of the reboot. And since all of us were at home and the reboot of our central gateway can't be monitored easily from remote, one of us drove to the server room at Hönggerberg to conduct and supervise the reboot.
First it looked as if the reboot works fine. Since the box had an uptime of about 275 days, a filesystem check was necessary. It went fine, but after the check the system started complaining about memory problems and didn't react anymore to keystrokes. It had to be shut down the hard way by switching power off. After the box cooled down a bit, we powered it on again and it seems to work fine again since then—only two RAIDs had to be resynced after the hard poweroff.
As a result of this outage we will have to check the hardware—especially the memory—of the server gateway somewhen next month, so there probably will be some announced server downtimes during January for doing those hardware checks. We'll try to keep them as short as possible.
< | >