Not a web control, I’m talking about the notion of applying grid computing to large scale distributed data provisioning. I’d like to suggest a pattern and see if anyone can tell me if a product provides this, or if this is described elsewhere. I’d like to buy it.
The data grid is not a new concept. (see http://www.gemstone.com/solutions/gridcomputing.php and http://www.gigaspaces.com/pr_ce.html ) This idea allows you to create very fast delivery of data across a distributed infrastructure. This is useful for grid computing applications that allow massive multiparallel execution in a simplified environment, where data bottlenecks can starve your virtual supercomputer and completely screw up your ability to deliver.
One assumption of the data grid is that the memory is large enough to store and retrieve the data. What if there is a LOT of data… Gigabytes. What if it is not feasable to keep it in memory? For example, if someone were to query the Microsoft customer database, and ask for all customers in Kansas, they’d get millions of records. Simply creating an infrastructure that can respond to such a request prevents us from using memory structure… but only for requests for very large amounts of data.
Requests for fairly small amounts of data can easily be served from memory.
Therefore, small data domains should be served from memory. They can be preloaded and made ready by distributing the domains to many servers “in the cloud.” However, the in-memory data grid is not enough.
Let’s assume that I have customers all over the world, and I need to deliver gigabytes of fresh, real time, data to all of them. The source systems can live anywhere. The consuming systems should not need to know where. Data Grids are good, but don’t cover the need for large data stores.
I need to combine the data replication of RDBMS systems with the speed and distributed nature of the Data Grid. Add to that: I’d prefer for it to be event driven (although I can write an event adapter for a source system that cannot, of and by itself, generate events).
So the notion works like this: I distribute, around the world, a set of database servers, highly redundant and reliable. On top of them, I place data grid servers (one, two, twenty, whatever). That creates a data grid cluster. I put in a directory service that allows an app to start up anywhere and find the nearest data grid cluster.
When a source application creates a new data element, it sends an event to the nearest data grid cluster informing it of the primary keys and some base data. That element is replicated around the world, first to memory and then to persistent storage. Depending on policies, the grid clusters can request full data for the data item from the source system, or they can wait until full data is requested by an app.
The local grid cluster is highly redundant and persistent. Members all contribute memory to storing different data elements, but all of the data is stored in persistent storage as well. That way, if a data request needs large sums of data, then the data grid can force an ETL process between it’s persistent data store and the requestors database system, potentially moving millions of rows of data without having to package each row in an XML transaction, send it across the wire, and interpret it into a local database. This Database-Refresh style request is what really differentiates this pattern from a ‘standard’ Data Grid.
OK. I want one. I’m working to understand and define this, and figure out how much of this is out of the box with SQL Server 2005.
ROI: every app gets rapid access to millions of rows of data, worldwide, without needing to know the source for the data, or the parameters of the source system’s ability to feed that data to the data cache infrastructure. Basically, every app loses it’s data access layer “into the cloud.”
Do you know of a product that provides a persistent distributed in-memory data cache based on event-driven data propogation models, preferably using canonical schemas that makes its data available over web services?