Wednesday, May 19, 2010

Trac and forum down

All started Sunday, around 6AM GMT, our 4-year old dedicated server wasn't responding anymore and even a hardware reboot didn't bring it up. So, I opened a ticket and the technicians noticed the power supply died and quickly replaced it.

Everything worked fine until Monday morning, 9AM GMT, the server started to be unstable. I first thought it was Apache because during my tests, the process used several times 100% CPU when it crashed.
Then I tried stopping Apache and MySQL, the 2 most consuming processes (the CPU usage was on average at 6% without these 2 processes) and even with that, it was crashing after 15 minutes.

I thought that our kernel might be corrupted due to the crash of the server, so I tried using one of their netboot kernel (as well as the hardware testing mode) and it kept crashing.

So, I just opened another ticket for this issue. I really wonder what's going on.

Also, the migration of the forum and trac to the new server was planned at the end of this month but it might happen sooner than expected (I'll try to do it this week-end).

I'll keep you updated.

4 comments:

  1. Maybe Power supply or RAM then?

    ReplyDelete
  2. It's a possibility.

    In the last report they thought it might be the cpu fan or RAM. CPU fan failure can explain it as well as another defect power supply but I don't think it's a RAM failure (it worked fine 4 years).
    They completely replaced the server (with the exception of the hard disk) and it worked.

    ReplyDelete
  3. They got their money back on it many times over so they did the easiest and responsible thing for a good customer. I have a VPS server and they wont answer any questions at all because they (VPS) are unmanaged. =(
    I have to fiqure everything out 4 myself as long as I can log into my Cp-WHM.
    Good 2 hear all is well now! Nice work.

    ReplyDelete
  4. tom, I have the same problem when I ask questions to the customer service, like last year, when we had that traffic issue, they kept avoiding my questions and gave the default answer.

    For this issue, I opened an incident and I guess they take it much more seriously as it doesn't happen as often.

    Hopefully, most of the time, it works fine so ;)

    ReplyDelete