Does the following situation sound familiar? From one minute to the other,
your production servers grind to a halt, terse emails are complemented by
equally hectic phone calls, and the first order of business is to get back up
and running. After the dust settles, you're usually left with a pile of log
files and the assignment of figuring out what happened, why it happened, and
what to do to keep it from happening again.
A common first step is trying to reproduce what has gone wrong. More often
than not, this consumes a considerable amount of time that would be better
spent on actually fixing the problem. In this first blog post of a series, I
will present a Step-by-Step Guide to Diagnose Stuck Transactions within
minutes and show how a modern APM Solution helps to pinpoint common
production problems, without spending hours on reproducing it at first.
The Problem: Re... (more)
The killer in any IT operation is unplanned work. Unplanned work may go by
many names: firefighting, war rooms, Sev 1 incidents. The bottom line is that
Operations must stop whatever planned work it was doing to manage this drill.
This means little or no normal work is being accomplished. It is a scenario
most of you will be familiar with: your application servers are humming along
happily until suddenly, without an obvious reason, memory usage starts to
increase, soon followed by longer garbage collection suspensions that finally
force you to restart the application. The operati... (more)