Andrzej Parkitny is an Enterprise Integration Architect. Andrzej has fourteen years of software development and software architecture experience. He has designed and developed solutions in the health insurance, finance, engineering, retail, telecommunications, and healthcare industries. He holds an Honours Bachelor of Science in Software Engineering (Computer Science) from the University of Toronto. He has extensive experience in Service Oriented Architecture, Application Architecture, Solution Architecture, Software Development and Data Architecture. His expertise in SOA is supported by a SOACP certificate (Certified SOA Architect).
Modernizing Data Access in the Enterprise Published: November 21, 2013 • Service Technology Magazine Issue LXXVIII PDF
Abstract: Within a large organization, we are faced with the challenge of managing a large amount of data and providing read/write access to that data to multiple applications in the enterprise. In addition to this, some of the data sources may be exposed through service interfaces while other data sources have applications that are accessed directly. As good IT integrators, we should avoid tight coupling whenever possible, and if and when we identify tight coupling, we should recommend alternatives to the stakeholders involved that own the data and access that data. We should consider a number of architectural patterns and approaches which include, but are not limited to, SOAP Patterns (Legacy Wrapper, Service Façade, etc.) [REF-2], RESTful Service Patterns (Reusable Contract, Lightweight Endpoint, Entity Linking, etc.) [REF-3], and most importantly from an implementation perspective the utilization of data grid technologies (such as ORACLE Coherence [REF-4], IBM WebSphere eXtreme Scale [REF-8], and JBoss Infinispan [REF-9]). This article explores how the utilization of data grid technologies and the application of Service Oriented Architecture patterns can help IT integrators address the need of providing data resilience and availability in the enterprise.
Introduction: What are your organization's data access needs?
As software integrators, we are faced with many challenges in order to help fulfill the needs of our systems stakeholders. A common challenge in today's enterprise is the management and provisioning of data en-masse to front end applications. In order to ensure that our systems our future proof and intrinsically interoperable, we should strongly consider using Service Oriented Architecture and adhere to the principles that are central to SOA.
Depending on the requirements from our business stakeholders, we may choose to expose services that are SOAP-based services or RESTful services. However, we should also consider what systems and data stores we wish to expose and how we choose to expose them. Traditionally, organizations have persisted their operational data on database technologies that implemented highly normalized entity relationship (ER) schemas. Examples of database technologies are ORACLE Database, IBM DB2, Sybase Database, and mySQL. For organizations that have a relatively small set of data to manage, a relational database is sufficient. However, as a data set that an organization maintains grows, scalability becomes a concern. Architectural decisions regarding database scalability may include scaling up (adding more disk space to database servers) and/or scaling out (adding more database nodes to a database cluster). In either case, we may still be faced with the issue of performance and reliability of the database cluster. From a logical/physical design perspective, performance may be impacted from the way in which the data entities are designed and related to one another, especially in the case of highly normalized schemas.
Consider doing a join between four logical entities across Customer, Product Instance, Product, and Product Feature to get the full view of features a customer has (see Figure 1 below for the logical structure).
Figure 1 – A simple entity relationship representation of a customer that has one to many addresses (current and old) and one to many products (represented as product instances); each product has one to many features.
If your data set is relatively small the latency of the query against a relational database is negligible. However, as the data set gets bigger the latency of such a query is impacted. One reason for that is the fact that database technologies are implemented on servers that persist the data on physical hard disks. The speed at which a data record is accessed is constrained by the read speed of the hard disk head on the hard disk platter(s). If each table is on different parts of the disk or even different disks altogether, that also impacts the read speed.
How can we solve this problem? One way is to use data grid technology. What is data grid technology? Data grid technology is a technology that uses ephemeral or in-memory data storage and processing as a strategy to mitigate the risks and challenges found in relational data stores. There are a number of data grid technologies available. These include ORACLE Coherence, IBM eXtreme Scale and the open project Infinispan, amongst others. Marshall et al. have a good definition of what a grid and data grid are; the respective definitions are:
"A grid in general is a form of loosely coupled and heterogeneous computers that act together to run large tasks. To accomplish this task, a grid must be highly scalable. There are several different forms of grid, depending on the task at hand." [REF-8]
"A data grid focuses on the provisioning and access of information in a grid manner, that is, using a large number of loosely-coupled cooperative caches to store data." [REF-8]
In the context of the problem statement that we are discussing, that is, data read reliability and performance; data grid technology can assist us in increasing the performance and reliability of queries for data. Why? Firstly, data is stored in RAM (Random-Access Memory) in a grid cluster – this eliminates the latency issue of accessing data on a physical hard disk. Secondly, the data is logically organized in caches (which may be related to one another logically as a hierarchy) rather than in relational structures, as in the case of relational databases. Regardless of where the data resides in a data grid cluster, the index of a given record points to data directly to RAM memory addresses (managed by the data grid technology framework) and not an address on a physical disk. There may be latency issues if a given node(s) resides in a separate geographic location from the query management node, but this would be an issue in the case of relational databases as well, by the virtue of the network topology. This is where we can employ specific network topological strategies of where to place master and replicated data (as it becomes necessary in solutions we implement).
Data Grid Cache Concept – Maps, Named Cache and Serialization
Depending on what technology implementation you choose to use for your data grid solution, there are a number of concepts that are important to review. Objects in a grid cache are managed as map instances of key-value pairs; for Java developers that are familiar with java.util.Map [REF-6], you will be comfortable with the concept of a map object in a data grid since the object management technology within a data grid follows the same paradigm. Essentially, the data grid interfaces are a Map-like API to get data from the cached objects and put data into the cached objects. These objects are serializable, and from a simplistic point of view (if implementing/configuring the data grid instance using Java) these objects have to implement the Serializable interface in Java (public interface Serializable [REF-7]). Note that data grid technology provides APIs that expose the capability to use objects that are an alternative to "Serializable" and these have been optimized for performance.
The Backing Map [REF-4] is the mechanism by which data is stored and maintained within the data grid. For optimized performance it is best to store data on the heap (RAM memory); however, data can be stored off the heap but by doing so you lose the high availability (reduced latency) capability that you gain through storing data on the heap. This is managed by the data grid technology and is abstracted from the application utilizing the data.
If we go back to the data model we discussed in Figure 1, we can see that the entities are represented in 3rd Normal Form (3NF). A Customer entity can have one or more Address entities associated with it, a Customer entity can have one or more Product Instance entities associated to it, etc. Data is represented slightly differently in a data grid. One way of representing the data from Figure 1 in a way that is representative of how it can potentially be configured in a data grid is shown in Figure 2. Notice that the foreign keys from the 3NF model can be represented as explicit lists of keys to other named caches / map objects. For instance, Customer has AddressIDList which contains every instance of a key that maps to an Address named cache instance that is related to a specific Customer instance. Also notice that Product and Product Instance entities from the 3NF model are flattened out into one Product Instance named cache / map object in the data grid model.
Some things to consider when designing a logical data model for a data grid implementation:
Figure 2 – A simple named cache / map object representation of a customer that has one to many addresses (current and old) and one to many products (represented as product instances); each product has one to many features. Notice that the product instance and product entities are flatten out into one named cache / map object. Also notice that foreign key relationships are realized through id lists and that relationships back to parent objects (from their respective child objects) are realized through backpointers.
Data Grid High Level Configuration Overview
Data grids are designed in such a way that the component that provides the capability to access data in the cache abstracts the internal implementation of managing the data grid cluster. The data grid cluster can consist of two or more nodes. Figure 3 represents a conceptual overview of data grid technology. A data grid is configured to have multiple nodes where data can be stored (in memory). This data is highly available and reliable due to the characteristics of data grid technology. The example in Figure 3 shows that customer and address named cache / map objects reside on nodes one and two. In addition, product instance and product feature reside on nodes three and four. Data in the data grid can be accessed through a data grid catalog / query agent. Data redundancy is managed via the backing map agent in the data grid. Data is loaded into the data grid via a persistence agent. You will see in the data grid literature that loading data into the data grid can be referred to as "warming the cache". Data can be sourced from a relational database or another data source [REF-4].
The loading of data and the bi-directional update of data (in the data grid from source and from the data grid back to the source (via write-through)) is done via a persistence agent that is built upon Java Persistence API (JPA) technology. One can configure a data grid solution as a read only implementation. That is, the consuming application(s) will read from the data grid any time that application needs to get data originating from the enterprise data stores but not being constrained to the latency limitations those data stores inherently have. The writing of data back to the persistence data stores can be directly from service interfaces to those data stores or those service interfaces can be implemented to write through the data grid. The data grid would be configured to write back to source data stores as per the design you choose to implement.
The advantage of doing reads from the data grid is that you are no longer constrained by the latency of the persistence data store, you do not have to code the aggregation of data from multiple persistence data stores in each application that uses that data (for example, if you have customer data in database A and other customer data in database B, where each stores mutually exclusive lines of business customer data) the aggregation is done through the configuration of the data grid, and you do not have to implement complicated SQL queries in your data access code, but rather you use data grid specific filters that simplifies data access through the data grid APIs that go through the data grid catalog / query agent. Similarly, you may choose to write through data through the data grid that will update named cache / map objects in the data grid and also update source entities in the persistence data stores (based upon how you configure the write-through/write-back).
The backing map agent technology within the data grid provides the capability of synchronizing the named cache / map objects in the data grid, as well as with the persistence data stores, through the persistence agent (e.g. ORACLE Coherence JPA and TopLink [REF-5]). Changes are recognized within the persistent agent via examining the database trail files and then these changes are subsequently reflected in the data grid [REF-11].
Figure 3 – Data Grid Conceptual Overview
We have discussed the fact that data is held in memory within the data grid. In our example outlined in Figure 3, we see that mutually exclusive data (named cache / map objects) can be distributed over a number of nodes. This is further outlined in Figure 4 where two strategies for distributing the data in the data grid are shown. We can fully replicate the data on each node or we can partition it as was shown in Figure 3. Full replication can be expensive in terms of memory utilization. Partitioning data and collocating related named caches on the same nodes is a strategy that will provide better performance, especially if you implement write-through. If you implement full replication then any update will have to be reflected on each node. If you implement partition of the data then you would only have to update the data on the nodes where it resides (in our example we can consider Customer – so only two nodes instead of four).
Figure 4 – Data distribution in the data grid – full replication and simple partitioning.
Reliability and Availability in the Data Grid through Clustering
Regardless of which data distribution strategy you choose to implement, you need to design the data grid cluster to be able to hold the maximum amount of data even when a node in the cluster fails. For instance, if we have four nodes in our cluster and we want to design our cluster for the risk scenario that one node fails we need to ensure that three nodes have the capacity for holding the data. So if we wish to hold 5 million customers and their associated product instances, product features and addresses, we should design our cluster so that three nodes are sufficient to hold that amount of data (in memory). As shown in Figure 5, when one node fails (node 2) at time T, the data grid will react by identifying which named cache / map objects no longer have redundancy (the lack of which lowers reliability) and populate a redundant copy on another node in the data grid at time T'. In Figure 5 this is node 4. This capability is referred to as Reliability in data grid technology literature.
Figure 5 – Data grid reliability.
Another capability that data grid technology provides is high Availability of the data. This refers to the fact that (when the data grid is configured properly with respect to capacity planning) data is always available in the data grid, following from the data reliability characteristic and also that even if a data source (e.g. database) fails, consumers of the data in a front end application will be insulated from that data source failure. Figure 6 shows a scenario that exemplifies this. Application consumer one and two access data from the data grid, specifically Customer data (from data grid nodes one and two; although which node the data comes from is abstracted to the consumers via the data grid catalog / query agent). Note that Customer data in the data grid is sourced from the Customer DB. If and when the Customer DB goes down, application consumers one and two can still fulfill business process needs of the end users. Depending on how we design our overall solution we can also implement queues behind the data grid to ensure that we capture any transactions that are permitted when a backend database is down (asynchronous activity).
Figure 6 – Data availability through the data grid.
Another capability that you may want to consider that is available in some data grid technologies is the capability of evolvability. That is, following the characteristic of being highly available, we may want our data grid solution to be modifiable even when it is up and running. Let us consider the logical data model in Figure 2. The Customer named cache only has a PrimaryEmail. As is often the case, new business requirements may dictate changes to our data model. If a business requirement (that was unforeseen), such as adding a secondary email address is in the next phase of a portal implementation for example, we can implement our solution so that the named cache object is evolvable – this capability is available through ORACLE Coherence [REF-5].
Exposing Data Through Services
When designing a data grid solution, consider exposing data from the data grid through SOAP-based service through Service Façade and Legacy Wrapper. Also look at other patterns that may be relevant to your specific solution – you can refer to Thomas Erl's SOA Design Patterns [REF-2] for further material.
Table 1 - Service Oriented Architecture Service Patterns, sourced from Thomas Erl's SOA Design Patterns [REF-2].
Figure 7 – Exposing the data grid through Service Oriented Architecture (SOA).
Although a data grid vendor solution may provide a RESTful service interface to data in the data grid, it may not necessarily meet your SOA governance needs and/or security needs. By applying SOA governance practices you may consider designing and implementing your own RESTful services that have managed service contracts within an Enterprise Service Registry.
When designing RESTful services for exposing data in the data grid, please consider RESTful service design patterns, such as Reusable Contract, Lightweight Endpoint and Entity Linking that may be relevant to your specific solution – you can refer to Thomas Erl's SOA with REST: Principles, Patterns & Constraints for Building Enterprise Solutions with REST [REF-3] for further material.
Data Transport Security
Now that we have reviewed some of the background for data grid technology and service oriented technology patterns (relevant for this article), let us consider how we can implement secure access to data in the enterprise. Data transport security can be achieved at the transport layer and also at the message layer. The design patterns that should be considered for service interaction in the enterprise are the Data Confidentiality pattern, the Data Origin Authentication pattern, Direct Authentication pattern and the Brokered Authentication pattern. These are built upon industry standards – XML-Encryption, XML-Signature, Canonical XML, WS-Security and the Security Assertion Markup Language (SAML). The following is an overview of the patterns that should be considered (as outlined by Thomas Erl in SOA Design Patterns [REF-2]):
Table 2 - Service Oriented Architecture Service Interaction Security Patterns, sourced from Thomas Erl's SOA Design Patterns [REF-2].
Now that we have reviewed what data grid technology is at a conceptual level, I encourage you to download a developer version of a data grid technology in order for you to better understand how it can be used with respect to your specific interests. A logical approach is to:
As system integrators, we can help to provide a solution to the challenge of providing data to front end applications so that the access to the data is reliable and highly available – this solution is to use data grid technology. We have identified that data grid technology provides reliable and highly available data access through the fact that data is maintained in memory (specifically in RAM) on nodes that are organized in a data grid cluster. We have also identified that data in the data grid is sourced from persistence data stores (data bases) and that data changes in the persistence data stores can be reflected through configuration and agents that manage updates in the data grid from the data stores.
We have also recognized that in order to ensure that our systems are future proof and intrinsically interoperable, we should use Service Oriented Architecture and adhere to the principles that are central to SOA, including patterns that ensure that the exchange of data through message exchange passing patterns is secure and trusted. Specifically, this can be achieved by using SOA Service Interaction Security Patterns which are: the Data Confidentiality pattern, the Data Origin Authentication pattern, Direct Authentication pattern and the Brokered Authentication pattern.
For further information, take a look at the various data grid vendor solutions such as ORACLE Coherence, IBM eXtreme Scale, and Infinispan or others that fit your needs. Good luck in your data access modernization efforts!
Overview of Data Grid Technologies
Table 3 - This overview of data grid technologies is here to give you an idea of what is available in the data grid technologies discussed in this article – this is by no means an exhaustive analysis of the technology – for that, please go to the source material ([REF-4], [REF-8], [REF-9], and [REF-10]) and other supporting material for the technology you may be interested in for your organization and your specific solution(s).
[REF-1] Erl, Thomas. SOA Principles of Service Design. Toronto: Prentice Hall, 2008.
[REF-2] Erl, Thomas. SOA Design Patterns. Crawfordsville: Prentice Hall, 2009.
[REF-3] Erl, Thomas et al. SOA with REST: Principles, Patterns & Constraints for Building Enterprise
Solutions with REST. Toronto: Prentice Hall, 2013.
[REF-4] Aleksandar, Seovic et al. Oracle Coherence 3.5. Birmingham: Packt Publishing, 2010.
[REF-5] ORACLE. Coherence, JPA , and TopLink Grid: Hands On Lab. (2013) http://www.oracle.com/technetwork/testcontent/coherence-jpa-lab-instructions-linu-133199.pdf
[REF-6] ORACLE. Javadoc for Map Interface (2013) http://docs.oracle.com/javase/7/docs/api/java/util/Map.html
[REF-7] ORACLE. Javadoc for Serializable Interface (2013)http://docs.oracle.com/javase/7/docs/api/java/io/Serializable.html
[REF-8] Jonathan, Marshall et al. WebSphere eXtreme Scale V8.6 Key Concepts and Usage Scenarios,
sg247683. IBM Redbook (2013) http://www.redbooks.ibm.com/abstracts/SG247683.html?Open
[REF-9] Francesco Marchioni and Manik Surtani. Infinispan Data Grid Platform.
Birmingham: Packt Publishing, 2012.
[REF-10] Infinispan Tutorial (2013) http://www.mastertheboss.com/infinispan/infinispan-tutorial-part-1 & http://www.mastertheboss.com/infinispan/infinispan-tutorial-part-2
[REF-11] Smith, Shaun. Oracle Coherence GoldenGate Adapter. (2013) http://coherence.oracle.com/download/attachments/13173077/Coherence+GoldenGate+Adapter.pdf?version=1&modificationDate=1331118682338
[REF-12] Red Hat. JGroups Website (2013) http://www.jgroups.org/