Friday, March 18, 2016

Nexus 7k ISSU code update issues

So as part of that datacenter migration I'm working on, I need to get all my Nexus 7000's on the same code revision. First because they are a few years out of date, but more importantly because newer code has the ability to have dissimilar VLANs on each side. I fly out to the first DC and since it's a DR site with a change window for the whole time I'm there I start preparing for the first code upgrade. As always, start with getting a backup of the running config (both on and off chassis), as well as an overview of routing protocol neighbors, routing tables, etc to validate against in case of issues.

NTG7010-CORE#copy run start
[########################################]  
100%Configuration update aborted: request was aborted

What? No it wasn't. Try again, same result. Not how to start a "simple" upgrade. Start looking into it and it starts to look like a bug. We find a few similar situations, like Geeky Nick's issue. While not 100% identical there were a lot of similarities, and essentially it amounted to systems with a long uptime fill some files to the point no more space remains (I can't seem to find the bug ID). The "workaround" is to reboot the device to clear it. I reboot the standby supervisor thinking if I can reboot that one, do a switchover, and reload the previous active supervisor then I can work around the issue. Unfortunately, the standby then started into a bootloop. This was something Geeky Nick's blog post had as well. We couldn't get the boot image on the standby (it wouldn't recognize USB, or TFTP on my laptop), so I grabbed the show run and saved it a few different places (tftp worked, as did writing to a USB stick), issued the reload command, and hoped it worked.

Thankfully it did. No config lost and could save to bootflash. Yay! From there the upgrade went as planned, with a minor exception that I'll talk about in a later post. When doing an ISSU, bugs from previous versions can carry over through ISSU, even if the bug is fixed in the release you're on. To resolve it we had to reload the modules and it resolved the problem.