ServiceTechMag.com > Archive > Issue XXXVIII: April 2010 > Understanding Service Composition - Part I: Dealing With Workflow Across Services
John deVadoss

John deVadoss

Biography

John deVadoss leads the Patterns & Practices team at Microsoft. His responsibilities include platform and architecture strategy for the developer tools and the application platform. He has over 15 years of experience in the software industry; he has been at Microsoft for over 10 years, all of it in the enterprise space – as a consultant, as a program manager in the distributed applications platform division, as an architect working with some of Microsoft’s key partners, director of architecture strategy and most recently leading technical strategy for the application platform.

Prior to Microsoft he spent many years as a technology consultant in the financial services industry in Silicon Valley. His areas of interest are broadly in distributed application architectures, data and metadata, systems management and currently on edge architectures (both services and access), but most of all in creating business value from technology investments.

John has a BE in Computer Engineering, and an MS in Computer Science from the University of Massachusetts at Amherst, where he also did graduate work towards a PhD in Computer Science - which he hopes to complete at some point in the future.

Contributions

rss  subscribe to this author

Bookmarks



Understanding Service Composition - Part I: Dealing With Workflow Across Services

Published: April 8, 2010 • SOA Magazine Issue XXXVIII
 

Introduction

Whenever a service composes another service, meaning that one service uses the capabilities of another service or services to perform its own tasks, usually by means of workflow technologies, autonomy will be affected. In other words, when your service directly depends on one or more services the level of freedom and control you have in developing the service will be limited. The level of control over different runtime characteristics may also be affected.

If you make synchronous calls to another service you will affect the control over runtime resources, such as threads. If you need to wait for the response of another service before you can return a response from your service, you depend on that other service in terms of response times. That translates into reduced predictability of how your service performs. To make matters more challenging, response times are often regulated by SLAs.

The reliability of your service will also be affected. If something goes amiss in your service it will not be able to do its job as expected. Within a composite service this problem is exacerbated - the reliability of the composition is never higher than the reliability of the worst composed service and in the worst case, the limited reliabilities of all the composed services will be multiplied to produce the reliability of the composition.

Workflow across services provides the underpinning for successful service composition, and in a way, for service-orientation at large; we will examine in more detail what this entails for the real world in the next few sections.


Key Concepts


Distributed ACID Transactions

Using service composition to solve a task is just another way of saying that the task requires a joint effort to be pulled off. Whenever you have any kind of joint effort you introduce the possibility of some of the collaborating parties succeeding and other parties failing. Dealing with partial failures (some services in a composition succeeds while others fail) is a problem that we do not need to worry about for self-contained applications. In traditional applications transactions can help us guarantee that state is either changed the way we expected it to be changed (committed) or not changed at all (rolled back). From a consistency point of view either of these options is good. Partial failures though are bad as they can introduce inconsistencies into the persistent or durable state.

Considerable efforts have been undertaken with the aim of combining transactions with services, particularly in the Web services realm. Some standards have been developed and implemented by several vendors. Sadly, the different vendor implementations are still marred by interoperability issues when you mix and match different technologies. In practice the same technology is often used to build all the services participating in a transaction. The reason for this is that making transactions between services work will be that much easier if you are not using WCF technology in one service and some kind of Java technology in another service. In other words, programming transactions between Web services will limit your technological options and in practice, force you to choose and commit to one vendor platform.

However, even if you go to great lengths in your attempt to standardize the technology of your collaborating services, you may fail utterly and completely in holding an ACID transaction together. And, the reason for this failure is legacy. To make legacy systems play along nicely in your ACID transactions is in many cases simply not appropriate. The amount of investment and complicated custom code necessary will often be beyond justification.

If, however, you get lucky and you are actually able to either commit or roll back all the work done by your collaborating services in one ACID transaction, you would do well to think about it one more time. Transactions between services introduce serious runtime dependencies. Calling a service from another service restricts autonomy, especially if you use synchronous calls. If an ACID transaction is present you are also accessing some kind of persistent or durable state storage, which most of the time will be a database. If you are accessing such data stores in several services as a part of a transaction you are in effect locking important resources while waiting for all the parties in the transaction to vote for the outcome of the transaction and then commit or roll back. This can lead to severe problems with performance and even timeouts when several parties are attempting to access the same data at the same time.

Regardless of the problems with distributed transactions (that is transactions that cross service or capability boundaries) you would do well to use transactions within the borders of your service capabilities. Using transactions within a service capability is a fine approach, although you may run into difficulties if your service has to deal with multiple data sources and some of them cannot be part of your capability level ACID transaction.


Using Retries to Avoid ACID Transactions

In some cases a retry is a very good option. If one of the services used as a part of a service composition fails to do its job, you may actually just try to call the service again. In some cases a retry would obviously not be a good idea, like if the data in the message doesn’t validate. In other cases, a retry might work out well, such as in the case of a timeout. If a service is not up and running when calling it initially, then a later retry may alleviate the problem.

Some very robust systems have been created using retries that can be conducted over a long time period (days or weeks). This can be performed more effectively if it is used in combination with monitoring. If the composition controller can monitor the services that are a part of the composition, it will know when the services are up and running, and can then schedule message resends appropriately.

Retries have their own problems though. When a message is resent it could potentially be delivered multiple times to its destination. It is sometimes critical that no unwanted side effects are produced as a result of such duplicate delivery.

Another kind of problem that may need to be handled is that of poison messages. Even though we attempt to differentiate scenarios where a retry is a good idea from the ones where a retry is futile, we may not be able to know enough up-front to cover all possible or even plausible cases. To handle such cases, a maximum number of retries can be defined after which a message will be considered a poison message.

Since retries may be in progress over an extended period of time, we should make our services, and even the end user, aware of the state of our retries.

Eventually you may have to give up – meaning, you have to cease with your retries without having reached your objective. One way to handle that is to allow for humans to take a look at messages and take appropriate administrative intervention. This will require you to make messages as well as an audit log of events available to an administrator, plus a way for the administrator to manually correct the problems.


Compensating Transactions

Another way to solve the problem is to use compensation handlers. If something goes wrong, you compensate for it and thereby undo the previous state changes. The programming model for signaling that you want to compensate for an action is quite straightforward. You could have symmetric operations that perform opposite actions, like Add and Remove. You could also store unique IDs related to operations and let consumers of your service compensate for a call using that same value and a special capability.

Determining the interface that is used to signal that you want a compensation to occur is most of the time a simple task. Not so with carrying out the actual compensation. Depending upon what your capability actually does the compensation may be really difficult to deal with. Creating logic that changes state in a predetermined and clearly defined way does not imply that creating the logic that reverses those changes will be clearly defined or even possible to do. One of the simplest kinds of irreversible capabilities is that of deleting information, in this case one solution is to not delete anything but rather to mark the data as modified/deleted. Working on complex problems can be challenging enough without having to figure out a way to reverse the whole process – which makes this strategy non-trivial.

Another challenging aspect to consider is that other consumers may call the service concurrently and introduce overlapping state changes. Some of the state changes introduced by other consumers may only be valid in the presence of the previously committed change, and is therefore invalid if those changes are compensated. One possible solution to this problem is to introduce a kind of temporary flag in the persistent or durable state.


Aspects of Implementation


Modeling Services

To create services that have the right kind of complexity, particularly reducing the complexity of integration efforts and collaboration, can be thought of as achieving the right level of cohesion within your services.

Note: Cohesion is a qualitative measure of how strongly related the capabilities of your service are; or how strongly related the different pieces of logic in your capabilities are. There are several different kinds of cohesion ranging from coincidental cohesion, which basically means that the capabilities just happened to be placed in the same service, to perfect functional cohesion, which means that all the capabilities contribute to solving a single well-defined task.

When modeling services, most developers tend to think in terms of logical cohesion. Two capabilities can be said to be logically cohesive when they are categorized to do some work on the same thing, such as an invoice. Drawing borders around services based solely on this kind of cohesiveness is a mistake as it will lead to an unnecessary high amount of service collaboration.

If instead you attempt to create services that group capabilities so that capabilities that work on the same data, or the output of one capability is used as input for another capability you will need less inter-service communication. These kinds of cohesion are commonly referred to as communicational and sequential cohesion respectively.

If you model all of your services to have perfect functional cohesion – which by the way is not a trivial task - you would probably see something like Figure 1, part A. You have created services that have the kind of cohesion that is considered to be the best. The advantages of functional cohesion include that it will be easy to understand what your services do and that it will be easier to know where to find a specific piece of functionality.

However, when you start to analyze how these capabilities are used, and specifically which of these capabilities have to be used together, you may see something like in Figure 1, part B. This is not good. As you can see there is a lot of interaction between some of the services. If you investigate further you may find that the services with the dashed fat border frequently use each other’s output as input and are often used together, and that the services with a solid border seem to work on the same data, as illustrated in Figure 1, part C. It may then be a good idea to remodel the services so that they become communicatively and sequentially cohesive instead, as illustrated in Figure 1, part D.


Figure 1

Attaining these kinds of cohesion will reduce the need for service collaboration. As a bonus it will also be easier to confine changes and refactoring to one service at a time. Some changes may still require you to change several services at the same time, but it will not happen as often when you model your services to get perfect functional cohesion.

In conclusion, even though the internal complexity of your services will be increased, the collaboration issues will be much reduced. In most cases you will not be able to create a very suitable solution without considering the balance between internal service complexity and service collaboration issues.


Service Agents

When a service needs to leverage the functionality provided by another service, it is often expedient to implement code to manage the semantics of the communications with that service. Service agents serve to isolate the idiosyncrasies of invoking diverse services and provide benefits such as mapping data formats (as we shall examine in a subsequent section of this chapter), as well as caching services requests and responses in order to deal with events such as loss of connectivity etc.

The service agent is often termed an agent, emissary or a proxy. Often a service agent also serves to enable off-lining via caching; queuing of service requests, as well as, the resolution of service location etc.

An instance of a service agent typically will need to live for the duration of a single long-running transaction or dialogue across services. A side-effect of this is that an instance of a service agent will usually need to manage what is termed ‘activity-oriented data’ – for example, the state which manages an instance of a dialogue between services.


Idempotency

In the absence of reliable messaging infrastructure, service requests may get lost or arrive more than once. Idempotency refers to the property of a service request that ensures that it is okay for the service request to arrive multiple times – for example, as long as the request is processed at least once, the correct result ensues.

One common real-world approach to making service requests idempotent (provide exactly once semantics) is to associate a unique request ID with every service request. Sometimes, this idempotence-specific information is added to the soap:Header section. With this additional information, the client simply retries the service request if it does not receive a response from the destination service.

Queuing systems offer single-message reliability, ensuring that each message is either delivered or the sender is notified of a problem. For some applications this allows for a ‘fire-and-forget’ implementation. If the work being done nicely encapsulates into a single message, single-message reliability works just fine. The message will be delivered at most once and this obviates the need to make the message processing idempotent.

Single-message reliability only tell us that the message got to the destination service; they don’t tell us that the destination service processed the message and associated service request, and they don’t inform us if the destination service had any challenges with respect to processing the service request. For synchronous request-response interaction, single-message reliability doesn’t quite help. The source service will still require some type of retry model in order to resend the service request if the destination service does not respond, which then leads to the need for the service request to be idempotent.


Conclusion

Service-oriented architectures need to deal with the space between systems and services. This is a side-effect of service autonomy – for example, business services are often independent entities with their own computing functions and data, usually managed independently. Workflow logic serves as the foundational capability, enabling service-oriented architectures to compose and harmonize multiple services to realize the goals of the business.