Home > general > pfew – that was *close*

pfew – that was *close*

August 24th, 2005 Leave a comment Go to comments

Our webserver at work went unresponsive a few minutes ago, and I thought I was going to need to reboot the thing, which would have been a shame:

erika@lpdweb erika $ uptime
 14:46:55 up 299 days, 14:21,  1 user,  load average: 0.06, 0.09, 0.07

I know that’s not the longest uptime ever, but it’s not too shabby. You just hate to reboot once you’ve gotten up that high.

I ended up just having to plug into the console and restart the network stack. Things came back up fine after that. I’ll have to grep through the logs now to see what really happened.

My boss likes to (sarcastically) joke with me about how I’m always rebooting our linux servers to fix things…I’m glad that I didn’t have to reboot this time. It would have given him more ammo to use against me :-)

Categories: general Tags: , , ,
  1. August 24th, 2005 at 21:06 | #1

    Yea, the one that really sucks is the linux kernel versions that roll over at 500 days. A few days before everyone gets excited, and then “wait a second, why did it reboot? Wait, stuff’s been running since 2 years ago, WTF?” Anyways, I’ve found that between 400 and 600 days, a “major event” usually occurs. Last year it was a 13.8KV 1200A breaker that blew up. Took out pretty much everything inside the cabinet, including the PTCs. No metering means no automatic load shedding and the generators collapsed under the load. Once they did get running we couldn’t transfer off them without interruption either. Of course since nobody is around the power was off for like 4 hours, and the UPSes don’t last that long. And it required 3 planned 6 hour outages to shutdown the electrical grid and get it completely rebuilt. So yea, I start worrying if too many servers get high uptimes now. :)

  2. erik
    August 24th, 2005 at 22:19 | #2

    Heh – I wouldn’t like to be around when a 1200A breaker goes. The last place I worked had a huge room-sized UPS and a three-way transfer switch to go along with it…you know, the ones where it transfers to the UPS for a few minutes while the diesel warms up and then switches to the generator. Anyway, I was in the room a few times when that sucker switched from utility to UPS, and man, that’s about as close to wetting myself as I’ve ever been. If you’re within 20 feet of the thing you swear that you’ve been shot.

    It’ll be interesting to see how long this uptime lasts. I’d be well over 400 days now if it wasn’t for a random 10-hour long outage last November that took out a quarter of downtown. But yes, I know what you mean, something always still seems to happen when uptimes start pushing a year. Well I look at it this way – forced reboots are just an opportunity to put an upgraded kernel image in place. I’m still running a 2.4 kernel on this box…yes I know – it’s ancient :-) When I originally built up this box, the 2.6 tree didn’t have good support for the Broadcom chipset Dell decided to put in their Poweredge servers, so I stuck w/ the 2.4. The driver’s stable now in the latest 2.6 releases, so I’ll gladly upgrade to that the next time I need to reboot.

  3. Bossman
    August 25th, 2005 at 09:48 | #3

    I don’t know what the deal is, but Erik is constantly rebooting those linux servers.

  4. erik
    August 25th, 2005 at 09:55 | #4

    Blah blah blah… :-)

    To be fair, a few of our windows file servers and our exchange ’03 server has been up for about the same time…which is quite incredible if you ask me.

  5. Tim
    August 25th, 2005 at 10:30 | #5

    Are you not patching them? :-) It seems that at least half of the patches require a reboot..

  6. August 25th, 2005 at 22:51 | #6

    A transfer switch to the UPS? That sounds risky – I think I prefer an online system. You’re making me curious how loud our switch gear is now though.

    Your Windows servers stay up that long? I guess for non-redundant stuff I have some that have been up a decent length of time, it just seems weird. Stuff like Domain Controllers and departmental servers it’s not worth the bother though. I set up a staged install schedule from a SUS server for just the servers, and when I approve the patches they just follow the system recommendations on reboots. I never really looked at how often, just that everything gets in there. It looks like only ever other month requires reboots now, and the apparently less-critical ones (the ones they release randomly rather than second Tuesday) appear more likely to require a reboot than the Tuesday ones. Weird.

  7. erik
    August 26th, 2005 at 01:30 | #7

    Yah – I was always curious why they didn’t have an online UPS…

    I *should* really set up a SUS server one of these days. Can’t be that difficult. I’m sure MS has some stellar documentation for the process. Ehem.