web services - Failover strategies for stateful servers -
in our project, have stateful server. server runs rule engine (drools) , exposes functionality using rest service. monitoring system , critical have uptime or more less 100%. therefore need strategies shut down server maintainance , have strategies able continue monitoring of agent when 1 server offline.
the first might put message queue or service bus in front of drools servers keep messages have not been processed , have mechanisms backup state of server database or storage. makes possible shut down server few minutes deploy new version. question is, when 1 server goes offline unexpectedly. there failover strategies stateful servers, experience? , advice welcome.
there's no 'correct' way can think of. rather depends on things like:
- sensitivity changes on time windows.
- how application needs brought up.
- impact if events missed.
- impact if events monitoring not second.
- how application raises events outside world.
some ideas enabling fail-over:
- start clean slate. examine serious impact of before spending time developing else.
- load list of facts (today's transactions perhaps) database. potentially replay in order. possibly whilst using pseudo clock. i'm aware of being used pricing applications in financial sector, although @ same time, i'm aware of systems can take long time catch due number of events need replayed.
- persist stateful session periodically. interval determined based on how far behind dr application permitted be, , how long takes persist session. way, dr application can retrieve same session database. however, there gap in events received based on interval between persists. of course, if reason failure corruption of session, doesn't work well.
- configure middleware forward events 2 queues, , subscribe primary , dr applications queues. way, both monitors should in sync , able make decisions based on last 1 minute of activity. note if 1 leg taken out period need catch up, , middleware needs capacity store multiple hours (however long outage might be) worth of events on queue. also, rules need work off timestamp on event when queued, rather current time. otherwise, when bringing leg after outage, raise alerts based on events in time window.
an additional point consider when replaying events don't want alerts raised outside world until have completed replay. instance don't want 50 alert emails sent applicationx down, up, down, up, down, up, ...
i'll assume monitoring application might pushing alerts outside world in form. if have hot-hot configuration in 4, need control alerts. tempted deal configuring each push alerts own queue. middleware forward alerts secondary monitor dead letter queue. failover reconfigure middleware primary alerts go dead letter queue , secondary alerts go alert channel. mechanism used discard events raised during replay recovery.
given complexity , potential mess can arise replaying events, monitoring application prefer starting clean slate, or going persisted sessions. may depend on monitoring.
Comments
Post a Comment