Detecting Trouble

Contents

This article provides guidance on how to identify any issues with your Recovery Console.

What should be on my console?

Your server’s console shows the output you’d normally see on the screen, and provides a “terminal of last resort” in case of network problems.

For a Linux server, you would expect to see a login prompt along these lines:

Debian GNU/Linux 8.0 stoneboat.vm.bytemark.co.uk ttyS0
stoneboat.vm.bytemark.co.uk login:

If you don’t see that, press Enter a couple of times and the login prompt should appear.

What shouldn’t be on your console

There are several common, critical errors which you may see on your console. Our systems do try to catch these errors and let you know when it spots them, as usually, they indicate some problem which requires action. Here is a non-exhaustive list:

Out of memory

In short: Your system has hit the buffers and has started killing processes to free up some resources. This may indicate that your server is running very slowly, or may be giving “connection refused” to web requests.

Common cause: You’ve installed an application that is gobbling memory, either slowly through normal use, or because your site has just been slashdotted and it is trying to service many more requests than usual.

Memory exhaustion is reported usually many times over by the kernel, like this:

[4430031.262899] apache2 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
[4430031.270801] Pid: 12027, comm: apache2 Not tainted 2.6.27.2 #1
[4430031.276953] 
[4430031.276954] Call Trace:
[4430031.281528] [<ffffffff8027ad3b>] oom_kill_process+0x57/0x1ec
[4430031.287690] [<ffffffff80250b96>] getnstimeofday+0x39/0x98
[4430031.293598] [<ffffffff8027b2d1>] out_of_memory+0x202/0x29e
[4430031.299595] [<ffffffff8027e65b>] __alloc_pages_internal+0x31d/0x3c2
[4430031.306419] [<ffffffff8027fe46>] __do_page_cache_readahead+0x79/0x183
... blah blah blah ...
[4430031.509759] 258032 pages RAM
[4430031.512970] 4944 pages reserved
[4430031.516423] 41878 pages shared
[4430031.519817] 246905 pages non-shared
[4430031.523713] Out of memory: kill process 10857 (apache2) score 63210 or a child
[4430031.531461] Killed process 11499 (apache2)

Disc failure

In short: One of your Dedicated Server’s discs is failing, or has failed, and will need to be replaced. Since our Dedicated Servers have at least two discs set up as mirrors. This is not catastrophic as after the failing disc is replaced the RAID array will rebuild.

Resolution: Bytemark will usually spot this and be in touch with you to arrange a disc swap, but our monitoring should be treated as a backstop, as it is your responsibility to inform us of possible hardware failure. Proper monitoring tools for your server will assist you with this and you can refer to the Check RAID Status documentation for how to detect possible disc failure.

A disc failure can cause various messages depending on the type of failure, and type of controller that is present in your server. Here are some:

[20.610934] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[20.611002] ata3.00: BMDMA stat 0x4
[20.611069] ata3.00: cmd ca/00:08:4f:10:88/00:00:00:00:00/ee tag 0 dma 4096 out
[20.611072] res 51/84:00:4f:10:88/00:00:00:00:00/ee Emask 0x10 (ATA bus error)
[20.611221] ata3.00: status: { DRDY ERR }
[20.611278] ata3.00: error: { ICRC ABRT }
[20.611355] ata3: soft resetting link

[37.841026] 3w-xxxx: AEN: ERROR: Unit degraded: Unit #1.

Other kernel backtrace

In short: The kernel reports some other message with the word BUG and a Call Trace, and possibly a panic message. There is no single cause for the kernel to report a bug or to generate a backtrace. This can indicate a serious error condition that should not occur in a production server. It can be triggered either by genuine kernel bugs (not common these days as our hardware is well supported by modern distro kernels) or faulty memory.

Resolution: Bytemark are generally always happy to swap out memory in the face of mysterious kernel backtraces or panics, and will help you track down what exotic kernel features you might be using which would trigger a panic.

Here is an example of a non-fatal kernel bug:

BUG: soft lockup detected on CPU#1!

Call Trace:
 <IRQ> [<ffffffff802add77>] softlockup_tick+0xdb/0xed
 [<ffffffff8028abe1>] update_process_times+0x42/0x68
 [<ffffffff8026f372>] smp_local_timer_interrupt+0x23/0x47
 ...blah blah blah ...

A full-on panic looks like this, and will be the last message the kernel prints before it stops working altogether:

[7402303.093024] Kernel panic - not syncing: Aiee, killing interrupt handler!

Common noise

This one is caused by interactions with broken servers elsewhere on the internet:

TCP: Treason uncloaked! Peer 84.160.113.134:49835/80 shrinks window 136462172:136518752. Repaired.

Some firewalls have an instruction to the kernel at the end of their lists which tells them to log and drop traffic that isn’t accepted; the log goes to the console and looks like this:

IPT INPUT packet died: IN=eth0 OUT= MAC=00:1a:4d:f7:27:50:00:1c:b0:c8:7b:80:08:00 SRC=122.193.239.14 DST=89.16.184.172 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=60431 DF PROTO=TCP SPT=56093 DPT=4899 WINDOW=65535 RES=0x00 SYN URGP=0

Updated on February 20, 2019

Was this article helpful?

Yes No

Have you tried Kubernetes?

Kubernetes (K8s) is helping enterprises to ship faster & scale their business. Sounds good? Let us build a K8s solution for your needs.