Since about a month I’m running the ZeroShell 3.5.0 now without crashes (uptime is now about 28 days). I’m still not sure why ZS 3.2.1 was stable on my PC and later versions crashed frequently.
My problem while debugging the crashes was that they occurred randomly. It was no option to sit in front of a monitor waiting for a possible crash related output because there were no crashes when I waited for them in front of the monitor…
Since I assumed from the beginning that the reboots were caused by crashes (as you know in Linux kernel they are called ‘kernel panics’) I read a lot about debugging those kernel panics.
I still found no way to catch the panic output and I also found no way to activate a crash kernel like on an Ubuntu system, which handles the panic outputs of the kernel and writes them to the log files.
So I had another idea. I installed ZS to HDD and switched the console output from VGA to serial interface. Then I connected the serial port of my router with a null modem cable to a second PC running a terminal session. So I was able to write all of ZS console outputs to a file and finally I could catch the kernel panic output.
The kernel panics were caused by MCEs (machine check exception). I googled a lot and learned that MCEs are usually caused by hardware malfunctions. So I ran a lot of tests on the routers’ hardware without any errors.
Then (after I googled even more) I found the hint, the MCEs also can occur because of bugs in the Intel processors firmware called ‘microcode’. One way to update this Intel microcode is upgrading the BIOS and I really found a newer BIOS version for the routers’ main board containing a newer version of the microcode.
After updating the BIOS the router runs stable with ZS 3.5.0 and the frequent crashes are gone.