Fiber damage causes 6 hour netBlazr outage

I suppose it had to happen, our first major outage.  Mostly I worry about wireless issues, but to the extent they cause problems at all, the problems tend to be local.  What happened last night was caused by damage to one of the fibers in a Cogent cable between the basement and the 54th floor of the John Hancock Tower and the result was an outage affecting everyone on our Back Bay network.  It started at 3:45pm on Friday and, with the help of a Cogent technician who was dispatched to the Hancock Tower, we were back up and running on a replacement fiber at 9:54pm.  For those interested in the gory details, read on.

At around 3:30pm on Friday I was investigating some strange behavior. Throughput on the upstream direction of our fiber connection from the John Hancock Tower to the Internet had fallen off dramatically.  While attempting to remotely diagnose the problem, we lost all connectivity (at 3:45pm).  My first assumption was something I’d done had crashed our router at the Hancock.  Luckily, the MikroTik routers have a “safe” mode which I was using.  This will restore original settings if remote control is lost, so I expected everything to recover in 2-5 minutes.  When that didn’t happen we got really worried.  Jason manned the Internet side of the link while I drove to the John Hancock Tower and connected directly to our Hancock router.

Well actually, before I went into the Hancock, I wasted more than 25 minutes parked on Berkeley Street, loggin into our router wirelessly.  This seemed like a good idea as, what with checking in and getting keys, the process of getting from the street to the 61st floor of the Hancock Tower typically takes more than 15 minutes.  Unfortunately, the problem couldn’t be diagnosed with just remote access.  So I ended up having to go up to the 61st floor anyway.  Here’s our rack on the 61st floor:

netBlazr rackOnce there, I quickly established that we had a working wired connection from our router to Cogent’s media converter but no data was getting through.  Of course I tried resetting the media converter and the router.  I also power cycled them and replaced the Cat5e jumper cable.  All to no avail, so we called Cogent support.  Actually Jason had to make the call, as Verizon cell phone coverage in the mechanical rooms on the 61st floor of the Hancock Tower is very marginal.  So, while Jason had a good connection with Cogent’s support team, I was intermittently partied in.  Cogent support led me through the same resets and power cycles.  They also had me connect my Mac directly to their media converter.  When they still couldn’t see anything connected to the media converter from their end, they dispatched their field tech.  By now it was past 6:30pm.

The field tech was scheduled for 8pm but called around 7:50pm to say he was stuck in traffic coming up the Southeast Expressway.  Meanwhile I was occupied.  By 6:30 it had gotten quite dark and there is no lighting in the corner of the 61st floor where the netBlazr rack is located.  So I went off to get some kind of work light.  In the end I picked up a nice table lamp at Marshalls on Boylston Street and a light bulb at CVS. I’ll try and remember to take a picture the next time I’m there.  It’s a very nice table lamp that now sits on top of our  rack. 🙂

By 8:30pm I was back on the 61st floor with the Cogent tech. He replaced the media converter to no avail.  He then checked signal levels and realized we had light levels below the allowable threshold.  We adjourned to the basement telephone room where he replaced the line card that feeds netBlazr’s fibers. Back to the 61st floor, but signal levels are still below spec.  So it’s a cable problem.

The fibers that feed netBlazr go from a Cogent electronics cabinet in the basement phone room to a junction box in a phone closet on the Concourse level, then to a junction box on the 54th floor and then to the wall near netBlazr’s rack on the 61st floor.  I’ve now seen more of the guts on the Hancock Tower than I really wanted to, especially on a Friday night when I’d had other plans.

By process of elimination, the problem turned out to be in the 12-strand riser cable between the Concourse level and the 54th floor.  Luckily, there were two spare fiber strand in that cable, so we got swapped over. Back on the 61st floor, the signal levels checked out, so we plugged in and sure enough — data both ways!  The routing tables appear to have settled in less than 30 seconds, i.e. before I had time to check more than one or two things.  In any event, I quickly got a flood of messages from netBlazr’s monitoring system as 68 monitored points all came back on-line at 9:54pm.

I’ll postpone reflections, lessons learned and things to improve for a later post.