A colleague of mine, Dottie Shaw, blogged recently about why Durable Messaging matters. I agree with everything she says. Even more so, I’d add that of the system quality attributes, the one that is most endangered by the SOA approach, and therefore the one that we need to be the most aware of, is reliability.
Reliability takes many forms, but the definition that I work from comes from the IEEE. IEEE 610.12-1990 defines Reliability as “The ability of a system or component to perform its required functions under stated conditions for a specified period of time.”
The reason that this becomes a problem in SOA is because the basic strength of SOA is the message, and the weakest link is the mechanism used to move the message. If we create a message but we cannot be certain that it gets delivered, then we have created a point of failure that is difficult to surpass.
One friend of mine, Harry Pierson, likes to point out that the normal notion of ‘Reliable Messaging’ is not sufficient to provide system reliability. You need more. You need durable messaging. Durable messaging is more than reliable messaging, in his lexicon, because durable messages are stored and forwarded. Therefore, if a system goes down, you can always rely on the storage mechanism to keep it from being lost. Reliable messages are kept in memory and simply retried until acknowledged, but lost if the sending system goes down during the process.
Of course, Harry and Dottie are not alone in this. In fact, when discussing reliability these days, web authors have started clubbing the terms together for clarity. Just search on “reliable durable messages” to get a feel for how pervasive this linguistic gymnastics has become. Clearly, messages have to be durable in order to improve system reliability. Discussing one without the other has become passe’.
Note that I view durability as an attributed of the message. I view reliability as a measurable condition of a system, usually measured in Mean Time Between Failure (MTBF). What becomes clear from this thread is this: in order to increase system reliability, especially in a system based on messages, we need to insure message delivery, and the best way to do this is through message durability.
So, we need message durability to get system reliability. Cool.
Where do we get it from?
Well, durability requires that a message be stored and that a mechanism exist to forward it. (you heard me right… I just equated ‘durability’ to store-and-forward. Prove me wrong. Find a single durable system that doesn’t, essentially, store the message and then forward it.)
By seperating storage from forwarding, we get durability. The message is saved, and the time and place when it is forwarded is decoupled from the system that sends it. Of course, the most demanding folks will ask for more than simple durability. They will ask that messages be sent once and in order. Not always needed, but nice when you can get it.
So, in your SOA architecture, consider this: if you are sending messages from one point to another, and you wish to increase the reliability of your system, you need to find a way to store your message first, and then forward it.
To build a quality system, however, you want to consider more than one System Quality Attribute. Sure reliability is important, but if I build a system that is reliable yet brittle, I’d be a poor architect indeed.
We need to consider reliability… and… Agility, Flexibility, Scalability, and Maintainability and all the rest. Just as SOA reliability requires durability, SOA flexibility and SOA agility both require the use of standard transport mechanisms. SOA scalability and maintainability both require intermediability. So we need a solution that doesn’t sacrifice one for another.
Unfortunately, our platform is lacking here. To solve this problem, we need a mix of WCF, SSB, Biztalk, and good old fashioned code. MSMQ should be able to do this, and it gets kinda close, but it sacrifices ease of operations, so no easy answer there.
On the project I’m on, we are using Biztalk for transactional messages, and for data syndication, we wrote our own mechanism based on SQL Agent and a durable protocol that gives us reliability without sacrificing intermediability and standard protocols.
Now if I could only get that out of the box…
8 thoughts on “Reliability in SOA is HUGE”
Thanks for that, now you’ve got my interest piqued.
I was wondering, what would be a good book to get started on SOA (primarily in terms of .NET & C#) ?
Sorry, Chris. I didn’t learn SOA from a book, so I can’t really recommend one. The good folks at skyscrapr.net may be able to help.
Few comments I have are
1. you don’t have the store and forward to be sequential. you just need to acknoledge after you stored but the forwarding can occur in parallel to that. Otherwise you are working at the speed of disks and not at the speed of the network
2. if you really want reliability you need more than durable messages, you also need the endpoint of the message transport to be transactional so that you can send or read the message within a transaction. otherwise, ofr instance, you have a reliability problem when you read the message off of the transport as your service can fail but the message has already been deleted. (note what I am talking about here is not a transaction from the sender to the reader rather 2 separate transactions at each end).
3. you can achive durability with just SSB or MSMQ(but then you don’t use standard protocols) you should be able to implement it with WCF over MSMQ as well.
4. you can increase reliability without a reliable transport by takiung care of the S&F on the sending and recieving sides while taking into account the possibility of duplicate messages (e.g. by making messages idempotent)
Thanks man, will definitely take a look over there.
I agree with every point.
I agree ‘store and forward’ doesn’t have to be strictly sequential.
WCF over MSMQ? Kinda knarley…
Or were you thinking about the Biztalk adapter?
Either way, you still get the issue with non-standard protocols.
Transactional is important, but not for reliability. It is important for isolation. (The "I" in "ACID"). I’d view that as more important for the ability to integrate than the ability to operate.
Although, I suppose you could trace a failure to this root cause, and therefore affect MTBF.
To be honest, I’m willing to forgo transactions in a loosely coupled world if the application handshake handles compensation.
Lastly, I agree that you can achieve durability by building S&F into the endpoints. I would say, however, that you only really need it to be at one endpoint, not both.
Thanks for the reply,
The issue of transactions solves the issue of message loss in case of a server crash, even in the case where you have durability.
Say I’m using a durable transport. My server gets the message off the "queue" and starts processing it, possibly writing some stuff to the DB. The server crashes for some reason, say a hardware problem. The DB transaction rolls back, so it stays consistent. If, however, the "queue" wasn’t enrolled in the transaction, the message will have been lost.
Message loss is a big deal.
So, the solution (as Arnon mentioned) is to use an endpoint that supports transactions as well as being durable. The server would open a (not-necessarily distributed) transaction, receive the message, process it, write results to the DB, and commit everything together.
The result is that a server crash may affect overall system performance, but would not necessarily result in a system-level failure – whereas message loss is definitely a failure.
Am I making sense here?
I see your point. If we have a durable receipt mechanism, we want to pull the message off of the queue inside the transaction.
I will come back to system quality attributes. I view performance as a pretty important attribute. Therefore, I will only use these mechanisms, with multiple stores and forwards and transactions as a mechanism for reliable async update of persistent storage, not for read-only operations and only occasionally for synchronous write operations (I try to avoid those, if I can).
I have a problem with any mechanism that insists on placing reliability so high on the priority list that defending against a server crash (which happens once in a blue moon) is sufficient cause to slow everything down.
I take what you mean by performance is latency, and I agree, no surprise there.
However, the choice of a durable, disk-based, store and forward mechanism isn’t necessarily the only option. For instance, durability is important in terms of fault tolerance. However, we can also get fault tolerance by replicating, thus allowing ourselves to keep data in memory. This is what the data grid/space technologies do.
Replication has the added benefit of higher availability too.
The hard bit that these technologies have to deal with is transactions. The way they get around that is by using a partition-centric architecture, something that’s also called a Space-Based Architecture (SBA).
Interesting times indeed 🙂