Building a redundant Syslog Server
Article created 2006-02-01 by
Rainer Gerhards.
For many organizations, syslog data is of vital importance.
Some may even be required by law or other regulations to make sure that
no log data is lost. The question is now, how can this be accomplished?
Let's first look at the easy part: once the data has been received by
the syslog server and stored to files (or a database), you can use "just the
normal tools" to ensure that you have a good backup of them. This is pretty
straight-forward technology. If you need to archive the data unaltered, you will
probably write it to a write-once media like CD or DVD and archive that. If you
need to keep your archive for a long time, you will probably make sure that the
media persists sufficiently long enough. If in doubt, I recommend copying the
data to some new media every now and then (e.g. every five years). Of course, it
is always a good idea to keep vital data at least two different locations and
on two different media sets. But again, all of this is pretty easy manageable.
We get down to a somewhat slippery ground when it comes to short term failure.
Your backup will not protect you from a hard disk failure. OK, we can use RAID 5
(or any other fault-tolerance level) to guard against that. You can eventually
even write an audit trail (comes for free with database written data, but needs
to be configured and needs resources).
But what about a server failure? By the nature of syslog, any data
that is not received is probably lost. Especially with UDP based (standard)
syslog, the sender does not even know the receiver has died. Even with TCP based
syslog many senders prefer to drop messages than to stall processing (the only
other option left - think about it).
There are several ways to guard against such failures. The common
denominator is that they all have some pros and cons and none is absolutely
trouble-free. So plan ahead.
A very straightforward option is to have two separate syslog servers and
make every device send messages to both of them. It is quite unlikely that
both of the servers will go down at the same instant. Of course, you should make
sure they are connected via separate network links, different switches and use
differently fused power connections. If your organization is large, placing them
at physically different places (different buildings) can also be beneficial, if
the network bandwidth allows. This configuration is probably the safest to use.
It can even guard you against the notorious UDP packet losses that standard
syslog is prone to (and which happen unnoticed for most of the time). The
backdraw of this configuration is that you have almost all messages at both
locations. Only if a server fails (or a message is discarded by the network),
you only have a single copy. So if you combine both event repositories to a
central one, you will have lots of duplicates. The art of handling this is to
find a good merge process, which correctly (correctly is a key word!) identifies
duplicate lines and drops them. Identifying duplicates can be much harder than
it initially sounds, but in most cases there is a good solution. Sometimes a bit
sender tweaking is required, but after all, that's what makes an admin happy...
The next solution is to use some clustering technology. For example,
you can use the Windows cluster service to define two machines which act as a
virtual single syslog server. The OS (Windows in this sample case) itself keeps
track of which machine is up and which one is not. For syslog, an active-passive
clustering schema is probably best, that is one where one machine is always in
hot standby (aka: not used ;)). This machine only takes processing over when the
primary one (the one usually active) fails. The OS handles the task of
virtualizing the IP address and the storage system. It also controls the
takeover of control from one syslog server software to the next. So this is very
little hassle from the application point of view. Senders also send messages
only once, resulting in half the network traffic. You also do not have to think
about how to consolidate messages into a single repository. Of course, this
luxury comes at a price: most importantly, you will not be guarded against
dropped UDP packets (because there is only one receiver at one time). Probably
more importantly, every "failover" logic has a little delay. So there will be a
few seconds (up to maybe a minute or two) until the syslog server
functionality has been carried over to the hot standby machine. During this
period, messages will be lost. Finally, clustering is typically relatively
expensive and hard to set up.
The third possible solution is to look at the syslog server application
itself. My company offers WinSyslog
and MonitorWare Agent which can be
configured to work in a failover-like configuration. There, the syslog server
detects failures and transfers control, just like in the clustering scenario.
However, the operating system does not handle the failover itself and obviously
the OS so does not need to be any special. This approach offers basically the
same pros and cons as the OS clustering approach described above. However, it is
somewhat less expensive and probably easier to configure. If the two syslog
server machines need not to be dedicated, it can be greatly less expensive than
clustering - because no additional hardware for the backup machine would be
required. One drawback, however, is that the senders again need to be configured
to send messages to both machine, thus doubling the network traffic compared to
"real" clustering. However, syslog traffic bandwidth usage is typically no
problem, so that should not be too much of a disadvantage.
Question now: how does it work? It's quite easy! First of all, all senders
are configured to send to both receivers simultaneously. The solution then
depends on the receiver's ability to see if its peer is still operational. If
so, you define one active and one passive peer. The passive peer checks if the
other one is alive (in short periods). If the passive detects that the primary
one fails, it enabled message recording. Once it detects that the primary is up
again, it disables message recording. With this approach, both syslog servers
receive the message, but only one actually records them. The message files can
than be merged for a nearly complete picture. Why nearly? Well, as with OS
clustering, there is a time frame where the backup does not yet take over
control, thus some messages may be lost. Furthermore, when the primary node
comes up again, there is another small Window where both of the machines record
messages, thus resulting in duplicates (this later problem is not seen with
typical OS clustering). So this is not a perfect world, but pretty close to it -
depending on your needs. If you are interested in how this is actually done, you
can follow our step-by-step instructions for our product
line. Similar methodologies might apply to other products, but for obvious
reasons I have not researched that. Have a look yourself after you are inspired
by the Adiscon sample.
What is the conclusion? There is no perfect way to handling syslog
server failure. Probably the best solution is to run two syslog servers in
parallel, the first solution I described. But depending on your needs, one of
the others might be a better choice for what you try to accomplish. I have given
pros and cons with each of them, hopefully this will help you judge what works
best for you.
|