//SOA and BLOBs — using SOA principles for block-oriented data transfer (Updated)

SOA and BLOBs — using SOA principles for block-oriented data transfer (Updated)

Abstract: What happens when a business transaction, in Service Oriented Architecture, is too big to fit into a simply SOAP transaction?  This (updated) article describes a problem of this nature and the solution that allows block-oriented data transfer to work in an SOA-based application.

Introduction

Some things should simply not be done. 

If I have a large batch file, with ten thousand records in it, and I want to transfer it from point A to point B, using an SOA model to transfer the each record, one at a time, is really dumb.  The folks who believe that “all things must be done in XML” will not gain any points with me on this.

On the other hand, sometimes a single record (a single business transaction) is big… big enough to consider block-oriented data transfer.  This article is about one such situation and how I am proposing to address it.

The big business transaction

I deal with documents.  Some of them are in “source” format (word documents, powerpoint presentations, Infopath forms, even PDF documents), while others are simply scanned images of a document (in TIFF mostly).  These documents, the metadata that describes them, and the relationships that bind them, can form the basis for a “set of documents” that business can understand.  A good example would be the papers you have to sign when you buy a house.  There are deeds and warranty trusts and loan papers and all kinds of stuff.  I don’t know what they all are, but I do remember that it took hours for my wife and I to sign them all.

Together, these documents make a package.

And now for the problem: we want someone to be able to submit all or part of a package of documents, from one party to another, over the web. 

Doesn’t sound so hard, does it?  Surely, we aren’t the only folks to deal with something like this, but I haven’t seen many examples of how this is done in other XML-based solutions.  Not even legal e-filing, where this seems a natural requirement.  Perhaps I just missed it. 

A “business document package” contains header information and many documents.  The list of documents changes over time.  In other words, I can create a set with four documents, add a fifth, then replace the third.  Each document can be a TIFF or another large-format document file (too big to fit in a SOAP message on HTTP). 

The SOA Mismatch

Service oriented architectures usually present the notion of a “business document” or “business transaction.”  For the sake of clarity, I will use “business transaction” since my transactions themselves contain binary objects that just happen to contain documents… it would be too confusing to describe any other way.

So we have a business transaction.  This can be implemented in many ways.  SOA says that a document is self-contained and self-defining.  Therefore, the document set must be self contained and self defining.

Normally, in the SOA world, if a business transaction is updated, we could simply replace the entire transaction with entirely new values.  So, if a transaction is an invoice, we would find the existing invoice header, delete all the rows associated with it, replace the values in the header, and add in the rows from the document.  All this is done as “data on the inside.” 

The problem is that the entire contents of the business transaction are huge.  Our self-contained transaction contains the header information and all of the scanned documents.  If each document is 2Megs, and we have 14 of them, then a 28MB SOAP message starts to seriously stretch the capabilities of the protocol.  It is literally too big to fit into a SOAP message without serious risk of HTTP Timeouts. 

So, we need the concept of an “incomplete” sub-transaction… and that’s where the solution lies. 

(Note from nick: we decided to go a different direction: I’ve added details at the end of this posting).

The SOA solution

In our interaction, we have two computers.  The sending side, where the transaction originates and the receiving side, that needs to end up with all the data.  Both sides are computer applications with database support underneath. 

The new transaction is created by the sending side.  It will send the document header and enough information for the receiver to know what documents survive into the final form.  Any existing documents that changed will be deleted from the receiving side.  All documents that don’t exist on the receiving side, when this process is done, are represented as an “incomplete records” in the receiving end’s database, along with some size data.

Now, the sending side asks the receiving side for the id of a document that is marked as “incomplete”.  The receiving side responds with a message stating that “SubDocument 14332 in document set AB44F is incomplete.  We have block 9 of 12”.

The sending side will then go to the database and extract enough data to send just one block… in this case block 10.  That could be simply 100K in size.  Wrap that up in a SOAP message and send it.  The receiving side will get the message, which contains a complete header document and the contents of this block.  The interaction is done, and will start over with the sending side asking for the id of a document that is marked as incomplete.

The conversation

So, it look like this:

Sender sends:

<MyDocumentSet id=”849C751C-FF5C-4438-A3F0-055B9EE786E3″ >
   <Metadata Filer=”Nick Malik” CaseNumber=”ABC123″ —other stuff — />
      <Contents>
         <Document id=”EBDE445D-5C26-43da-A142-E12A350EC1B6″ name=”MyDocument1.pdf” — other header info — />
         <Document id=”9E4F8C83-B2D1-4aee-8C53-B235D026CD1E” name=”Document2.doc” — other header info — />
         <Document id=”05B10DAA-2A01-406b-AAB0-6BAEEF98F7A8″ name=”MyDocument3.ppt” — other header info — />
         <Document id=”7135612A-CE48-4371-ABFC-F8EF70DF76CF” name=”MyDocument4.pdf” — other header info — />
      </Contents>
   </MyDocumentSet>

Sender gets the message and checks to see if that document set already exists.  If it does not, simply create the document set on the receiver side with four incomplete documents.  A much more interesting case happens if the document set already exists on the receiver side… so let’s look at that.

The receiver looks up document set 849C751C-FF5C-4438-A3F0-055B9EE786E3 and sees that it currently contains five documents.  The first three documents in the existing document set are named in the list above.  The fourth document above doesn’t exist in the existing document set, so it is an addition.  The other two documents in the destination document set must be deletions.

So we delete the two extra documents on the receiver side and add a document for MyDocument4.pdf, and flag it as incomplete.

Now, the sender asks the receiver for the id of any incomplete documents.  The sender replies with the id of the fourth row above: 7135612A-CE48-4371-ABFC-F8EF70DF76CF and the fact that no blocks of data have been successfully stored.

The sender side gets this response and decides to send block one of that document.  It goes to the data
base, gets the first 50,000 bytes of data, encodes it with Base64 encoding, and sends it back to the receiver as the following:

<MyDocumentSet id=”849C751C-FF5C-4438-A3F0-055B9EE786E3″ >
   <Metadata Filer=”Nick Malik” CaseNumber=”ABC123″ —other stuff — />
      <DocumentBlock id=”7135612A-CE48-4371-ABFC-F8EF70DF76CF” name=”MyDocument4.pdf” — other header info — >
           <Block totalblocks=12 thisblock=1 size=50000>
FZGl0OzI0NTk2MDs+Pjs+Ozs+O3…a really long string of base64 characters … 
           </Block>
      </DocumentBlock>
   </MyDocumentSet>

The receiver now appends this data to the current document on the receiving end.  Note that the receiver “knows” that, even though this message is complete, the document is not complete, because this is block 1 of 12 (see the <Block> tag above).

The sender then asks again: what documents are not complete.

The receiver responds again: 
Document 7135612A-CE48-4371-ABFC-F8EF70DF76CF is not complete… we only have one block of 12. 

The sender sends block 2… and on it goes until the last block is sent.  At this point, the reciever gets the final block, marks the document as complete, and appends the last set of data to the database.  The next time the sender asks “what is not complete” the receiver responds “everything is complete”

The loop terminates.

The motivation for doing block-oriented data transfer this way

Certainly, we could use FTP or some other mechanism for file transfer.  This method, though, has some characteristics that are interesting.  First off, this protocol is stateless.  That means that, at any time, the sender could stop asking about the status of documents on the receiver side, and nothing is lost.  The sender can go offline, or go to sleep, or lose connectivity, and nothing bad happens.

Secondly, because the block sizes are relatively small, SOAP doesn’t time out.  We can handle extraordinarily large files this way (theoretically in the terabyte range).

Thirdly, the sender doesn’t have to know much about the receiver.  It doesn’t have to know if the document set already exists in the database on the receiver side, because the header data is sent with every block.  Therefore, no Commands are being sent.  (See my previous blog on “commandless” documents).

Pros and Cons (updated)

At the time of my first posting, this idea was being floated to our development team.  There are pros and cons to this solution that I can discuss in more detail now.

The advantage of this model is that the receiving side is not getting any data that it doesn’t want or know what to do with.  The sending side asks “what do you need,” and the receiving side responds with “file X Block 10”.  However, this is still a communication protocol.  If the sending side decides not to ask, the receiving side has no option but to leave the content of its database incomplete. 

This is (a) counter-intuitive, and therefore hard to explain to business users and the development team alike (as I have discovered), and (b) we have mixed the details of data transmission with the details of data representation.  I hadn’t thought carefully about this when I first wrote it, but, on hindsight, it’s a bad idea.

An SOA transaction should be complete, self-describing, and self-contained.  The process above saves us from sending the same bits more than once over a wire.  That’s its biggest advantage.  But that’s not our biggest cost.  All of the wires that I care about, in my application, are owned by my company, and they are utilized at a fairly low rate.  Therefore, we don’t save any measurable dollars by making the data transfer process efficient.

On the other hand, if we seperate out the data transmission from the data representation, then we can test each seperately.  I can test data transmission of a 20 GB file by transmitting any 20GB file and comparing the results with the original.  I can test data representation by creating the business document on one end and copying it to the other using sneaker-net (walking it over) from one dev machine to another.  This test isolation is important for reducing complexity, and that will save measurable dollars… real money from my bottom line.

The forces that led us to SOA still exist: we want to decouple the sides from each other and we must transfer these large transactions over HTTP or HTTPS connections.

The new interim solution

We decided to seperate the data transmission from the data representation.  Therefore, we will create an envelope schema that simply provides a transaction id, the current block number, to total number of blocks, and a data field. 

So a transmission could look like this:

<Transmission id=”39B2A4DD-AD68-4ae9-AA68-FCC6A48A0FFA”>
           <Block totalblocks=12 thisblock=1 size=50000>
FZGl0OzI0NTk2MDs+Pjs+Ozs+O3…a really long string of base64 characters … 
           </Block>
</Transmission>

What goes in the Block?  A base-64 encoded form of the entire business transaction itself (possibly compressed).

The receiving side will collect together all the block, assemble the actual stream, decode it, and load it into an XML object.  From that, we can extract embedded documents.

This data is not optimized for transmission

We get a lot of data inefficiency in the data format here.  If we haven’t thought seriously about compression before, it’s starting to become important now.  Here’s why:

Uploaded document, in PDF form, is a page of text.  Notepad would represent it as about 1K.  In PDF, it would be about 5K because PDF includes things like fonts and formatting.  That’s fine. 

In our business document, that 5K becomes 6.7K, because, in our business document, we are embedding it in Base64 text.  Base64 is a format that represents three bytes (24 bits) as four characters of six bits each (24 bits).  Add about 2K of header information (to make our document complete) and the business transaction size hits 8.7K.  At this point, we take that 8.7K transaction and encode it, again, as Base64 for the sake of block transfer.  We now get 11.6K.

Our PDF went from 5K to 11.6K.  That’s double it’s original size, and that’s assuming UTF-8 encoding in the XML.  If we go with UTF-16 encoding, the XML files can hit 20K. 

On the other hand, if we compress just before we pack the data into blocks for transmission, we can take that 8.7K document and compress it down to just over 5 K, (even though it is character data, it is not going to compress further, because it is randomized, which removes the advantage of compression).  We take that 5K document and encode it in Base64, we go back up to 6.7K.  Now, that is efficient for data transmission.

The receiving side has to decompress, of course, but this may be worth it.

Conclusion

After reviewing the initial proposal to embed the data transmission mechanism directly into the data representation structure, we rejected the idea in favor of a mechanism that wraps the data representation structure with a data transmission structure.  This allows us to test data transmission seperately from data representation.  It also allows us
to stick to the original idea of keeping all of the business data together in a single business transaction, regardless of how large it grows to be.

By |2004-11-01T18:41:00+00:00November 1st, 2004|Enterprise Architecture|4 Comments

About the Author:

President of Vanguard EA, an Enterprise Architecture consulting firm in Seattle focused on the Pacific coast of the US. Nick has over 30 years of professional experience in management, systems, and technology. He is the co-author of the influential paper "Perspectives on Enterprise Architecture" with Dr. Brian Cameron that effectively defined modern Enterprise Architecture practices, and he is frequent speaker at public gatherings on Enterprise Architecture and related topics. He coauthored a book on Visual Storytelling with Martin Sykes and Mark West titled "Stories That Move Mountains".

4 Comments

  1. Mark November 2, 2004 at 6:51 am - Reply

    Actually, a solution that addresses these kinds of problems has already been implemented and put into production for exactly the problem domain you mention, e-filing. Goto http://www.irs.gov and search for "MeF" which stands for "Modernized e-Filing". A starting place to look is http://www.irs.gov/pub/irs-schema/4164_r2.zip and http://www.irs.gov/efile/article/0,,id=118575,00.html

    In short what they do is use a SOAP envelope as a data structure to create a manifest that describes what is in the rest of the return. The remainder of the data is either Xml (schemas are available for the IRS Forms) or binary (some tax returns require binary attachments), where the Xml or binary data is transmitted as a Multi-Part Mime document.

    This allows the transmission of some VERY large documents. This system was designed with the transmission of VERY large corporate tax returns in mind. Imagine how large the tax return for a corporation like General Motors or even Microsoft is.

    I would be interested in hearing your thoughts on the architecture they designed. The PDF file in the zip document reference above has quite a bit of the detail, including addressing things like virus scanning. You do scan those large binary attachments for viruses right? 🙂

  2. Nick November 3, 2004 at 1:26 pm - Reply

    First off, I find the following statement to be borderline foolish:

    "The tax return may also include non-XML documents, known as “binary attachments”,

    submitted in PDF format. These attachments are included in the tax return as separate MIME

    parts rather than inside the ReturnData element." (section 2.1.6 Binary Attachments, first paragraph).

    This means that the return has to be sent as a single large document (like an e-mail with MIME parts) rather than as a resumable conversation.

    All of the "large file transfer" happens outside of service oriented architecture. What we have here is a FORMAT that utilizes XML and MIME but not an ARCHITECTURE that utilizes SOA.

    To whit: Section 2.2 of the same document states: "It is important to note that a MeF tax return instance includes all XML documents (forms, attachments, and binary attachments) that make up the tax return."

    This is NOT what I said in my blog entry. My algorithm allows for documents 1 through 10 to be submitted on Monday, document 7 to be replaced on Wednesday, and documents 11 through 14 to be submitted on Friday, with document 13 superceded with a blank form (auditable delete) on Saturday.

    The format defined in the named IRS document does not define the conversation, and does not allow parts of the submission to replace one another. There are no questions that the sender can ask the receiver (in my algorithm, there are). In addition, if the data transmission is interrupted at any point along the way, even 10 bytes from the end, the entire transmission is trash. In my mechanism, if you have 100 documents to submit, and you submit 30 before the line is lost, you pick back up with the next 70, not the full 100 documents that the IRS format requires.

  3. Dennis November 5, 2004 at 11:31 am - Reply

    Your article does not address "updates". Don’t you want to send version or last-change-timestamp info to that docs that already exist on the receiver end get updated (i.e. flagged as incomplete)

  4. Nick November 6, 2004 at 9:30 am - Reply

    Dennis,

    I’m not sure what you mean by "does not address updates". The article above states:

    "Normally, in the SOA world, if a business transaction is updated, we could simply replace the entire transaction with entirely new values."

    You may be interested in my prior article on using SOA for database replication. This article is tangential to the previous one, which describes the use of "dirty flags" in the database to support the replication of changes from one place to another.

Leave A Comment

nine − 6 =