ServiceTechMag.com > Archive > Issue LIV: September 2011 > SOA in the Telco Domain - Part II: Capacity Planning of SOA-Based Systems
Masykur Marhendra

Masykur Marhendra

Biography

Masykur Marhendra Sukmanegara graduated from Bandung Institute of Technology. He took Informatics Engineering with summa cumlaude predicate and placed first on Global Warming Solution Technology from Environmental Ministerial Department. Starting his career as a junior telecommunication developer on Switchlab, he implemented MSC encoding/decoding modules based on 3GPP standard specifications. After the project was done, he then took part of Javan IT Services consultancy as a J2EE Engineer working on various Web applications and mobile application implementing in various industries. His career as a J2EE Engineer then continued in XL Axiata, a 2nd telecommunication provider in Indonesia, working on south and north application, integrating various telecommunication sub systems with Java technology such as IBM Netcool, SMSC, SMS Gateway, and others. In 2010, he was appointed to handle first SOA implementation in XL Axiata for Billing domain as a pilot project. Designing and architecting a SOA platform, delivering the service-oriented porting development from satellite applications, ensuring the service-oriented principles were being followed, maintaining SOA capacity, putting the baseline for next architecture, development, and monitoring the process.

Contributions

rss  subscribe to this author

Bookmarks



SOA in the Telco Domain – Part II: Capacity Planning of SOA-Based Systems

Published: Sept 13, 2011 • Service Technology Magazine Issue LIV PDF
 

Abstract - Service-oriented architecture in the telecommunication industry is the first but huge step for answering many challenges from management to fulfilling product timeline from marketing request. To be able to implement service-oriented architecture, we need to define at least what technology we will use, design of the system architecture, implementation strategy, and the roadmap itself. Last of all is how we manage this established service-oriented system; monitoring services performance, lifecycle of services, risk management, and so on. To maintain service’s performance at its best, we need to have good services capacity planning in terms of high availability, throughput of the services, resources consumption. On this journey, services will also evolve and expand. At this point, we also need to have good capacity planning of the platform, including Enterprise Services Bus, Messaging Bus, and other supporting platform like database.


Introduction

Nowadays, the telecommunication industry has very tight competition on delivering the best quality of services on short message, data, and subscription service to those services. Service-oriented architecture is the first but huge step for answering this challenge to fulfill the high demand of the subscriber. To be able to implement service-oriented architecture, there are a few things that we need to do. We need to define the technology we will use, determine the design of the system architecture, and plan the implementation strategy. Choosing the best fit technology is the first critical point to do. We need to have a set of criteria for evaluating the capability of the technology itself which can satisfy our requirements. In the telecommunication domain, 5-9 high availability is mandatory. Afterwards, we can then design the system architecture to fit our needs.

The last steps we need are on how we manage well-established SOA based system. Managing the service-oriented system includes managing service availability, service performance, service lifecycle, risk management, and so on. This step is important to keep the system stable. It must mitigate all the risks that may occur in the future. As the number of subscribers keep growing every day, more transaction are loaded to the system. This enables telecommunication providers to keep delivery services in its best performance with high-availability feature. To maintain service’s performance at its best, we need to keep evaluating the service capacity in terms of high availability, throughput, and resources consumption. Moreover, when services evolve and expand, we also need to define the platform capacity itself including Enterprise Service Bus, Messaging Bus, and other supporting platform like database.

Capacity planning of SOA-based system is a mandatory step to keep the system running on its best. It involves two main activities which are capacity planning of the services, and capacity planning of the platform. Services capacity planning is more on services sizing in horizontal view to be able to handle increasing incoming transaction requests with allocated system resources. Platform capacity planning is more about sizing of the platform capabilities to give system resources to all running services, including Enterprise Service Bus, Messaging Bus, and other supporting platform like database. In this article, writers will discuss about these two activities.


Capacity Planning on Services Level


Services Instance Sizing

As like another applications, SOA-based services also runs into several instances. Each instance can hold the same capability and capacity. To be able to handle required incoming requests, we need to define how many service instance we will run. First, we need to know how much traffic can be handled by one service instance. Measuring can be done through benchmarking and performance test in environment that most resemble like production one.

For example, let us measure a service for purchasing Blackberry package registration products from the UMB channel, and we will refer to this as service-X. Forecasted loads will be at 300 transactions per second (tps) with service level agreement not more than 25 seconds. To do the benchmarking, we can give load test step-by-step from 100 tps until y-tps where the services performance starts to degrade. As a sample, we have the below performance test results:


NO
# Load
(tps)
Response Time (ms)
& Increase Form
1
2
3
4
1
100
1200
 
 
 
 
2
150
1250
4,17%
 
 
 
3
200
2300
91,67%
84,00%
 
 
4
250
3210
167,50%
156,80%
39,57%
 
5
300
4000
233,33%
220,00%
73,91%
24,61%

Table 1 – Service-X Performance Test Result

We can see that most response time is gained at 100 tps, which is our baseline. Performance starts to degrade slightly when we give 200 tps loads to the service. Running service at 150 tps with 2 instances to handle 300 tps load can process faster than running 300 tps with 1 instance. On row no. 2, 1 transactions per second will takes 8,33 ms to finished. For 300 transactions per second, it will needs only around 2500 ms. If we compare with 300 tps with 1 instances it will need 4000 ms to complete. Within 4000 ms two instances of service at 150 tps can complete around 480 tps.

From this analysis, we can conclude that service-X can handle at most 150 transactions per second. And running two service instances at 150 tps will give better results than running one instances at 300 tps.


Services Resources Sizing

Another sizing that we need to do at services level is the system resources. System resources sizing is more about the processing unit and memory usage needed by a single service instance to run at certain transaction load level. Planning services resources size will then correlated with the platform capacity sizing. There are two main aspects we need to plan:

  1. Processing Unit - is defined by how many cores are needed by a single instance to run a certain transaction load. This is one of the most important aspects we need to calculate because service availability depends a lot on the availability of the processing unit; we cannot run more services if there is no processing unit available. We also need to put it when we try to design a Disaster Recovery Site for the SOA.

    To know how many instances we need, processing unit usage can be found out through benchmark and performance test on the environment that most resemble the production environment. For example, we have a performance test result of service-X as given in table 2 below for processing unit usage and we are using 150 tps with 2 instances as we concluded from previous example.


    NO
    # Load
    (tps)
    Response Time (ms)
    & Increase Form
    1
    2
    3
    4
    1
    100
    1200
     
     
     
     
    2
    150
    1250
    4,17%
     
     
     
    3
    200
    2300
    91,67%
    84,00%
     
     
    4
    250
    3210
    167,50%
    156,80%
    39,57%
     
    5
    300
    4000
    233,33%
    220,00%
    73,91%
    24,61%

    Table 2 – Service-X Processing Unit Usage

    In this the production environment, when the service runs and reaches 150 tps on load, it will give 2.75% extra for each instance on the platform processing unit. For example, the current condition of our platform still uses only 40% of the processing unit (on peak period). So it still safe to run two instances of service-X on the platform.

    This baseline data can also be useful when we need to do projection planning. Projection planning is important in making management decisions in regards to the expansion of the platform, both horizontally and vertically. This way they can overcome future events (like Idoel Fitri, Christmas Eve, New Year, etc.).

  2. Memory - is used by the services to store data when transactions run, and more is released when the transaction is finished. In some cases memory leakage can also happen. Whenever memory leakage happens, a service cannot release all the memory resources back to the platform. This is because of the quality of service implementation code on object management. So, whenever we want to put a service in production we need to make sure that all the services are free from memory leakage problem. This way it will not disturb our production runtime environment.

    Unlike the processing unit, to determine how much memory is needed by the services, we will need to do an estimation from the services activity process itself. For example, figure 1 describes the five main activities service-X contains.


    Figure 1 – Service-X Activity

    In the first activity, the service is translating the msisdn and keyword as incoming request parameter into the internal data structure. This internal data structure is called Request Payload. Request Payload consists of two main parts. They are:

    1. Header - a payload that defines the properties of every single message request. This part contains several elements like RequestID, EndSystem, TimestampIn, TimestampOut, Channel, UUID, ESBUUID, and more. Header part will be carried until the end of activity.
    2. Body - the main payload of the request. The Body part varies on each service, and depends on the specific internal data structure implementation. For example, in service-X the body payload consist of msisdn, keyword, subscriberNo, and soccd element in the body part.

      Figure 2 – Service-X Body Part

    For example, the maximum header parts will contain 2,048 bytes, and body payload will contain 327 bytes. So, we will have 2,375 bytes overhead.

    In the next activity, service-X will load subscriber profile (subscriber number and soccd) from database based on msisdn. Subscriber number and soccd will then be mapped with the request payload body. For example subscriber number will have 32 bytes at maximum and soccd will have 16 bytes at maximum. Second actvitiy will give additional 48 bytes on memory usage.

    In third activity, service-X will register the subcriber number to the corresponding package based on the soccd and keyword being input. Return value of this activity is only a boolean values, which takes 1 byte of data. So, in this activity will give additional 1 byte on memory usage.

    Unlike in the previous activities, the value of the fourth activity varies based on the package that the subcriber has been registered to. But we can take the maximum value of it into take account. For example, to have a good response message (which is defined by marketing team) we need 256 bytes. We can use this number as a guideline. In the last activity, the response message will only be sent out through defined channels (UMB or SMS).

    If we sum all of the activities above, we will have the following estimation:


    # Activity
    Header
    Body
    Extra Payload
    Total
    1
    2048
    327
    24
    2399
    2
    2048
    327
    48
    2423
    3
    2048
    327
    1
    2376
    4
    2048
    327
    256
    2631
    5
    2048
    327
    0
    2375
    Total
    329
    12204

    Table 3 – Memory Usage Estimation for Service-X

    For single transaction service-X will at least need 12204 bytes of memory. As mentioned in the previous example, if service-X will run at 150 tps per service instance, it will need at least 1,830,600 bytes of memory (around 1,74 MB).


Capacity Planning on Infrastructure Level

Platform is a system where applications run. However, not all applications can run in multi-platforms. So because the platform is the base foundation system in order for applications to run, we need to make sure it has the availability and scalability to keep the growth of the application. Some important aspects of the platform capacity are the processing unit, memory, and storage. In SOA-based system, there are Enterprise Service Bus system, Messaging Bus system, and Database system (optional) that need to have good capacity planning in terms of platform capacity aspects.


Enterprise Services Bus Capacity Planning

The Enterprise Service Bus is a system where collections of services are running to do mediation, routing, transformation, and orchestration to process incoming request into desired results. From the previous chapter, we already know what the requirements of a service can be ran on. Afterwards, we sum up together those requirements and they become requirements for Enterprise Service Bus capacity planning.

In telecommunication domain, 5-9 high availability is a mandatory attributes. Enterprise Service Bus (ESB) should able to serve all requests 24 hours a day. To keep availability of the ESB, we should have a good capacity planning and high availability strategy for it:

  1. Processing unit on one ESB should not exceed a number of threshold depends on policy we use.
  2. Memory unit of ESB should have adequate free paging space to serve services that needs more memory allocation.
  3. Network bandwidth should be big enough to distribute certain transaction loads packet to the SOA system (ESB, Messaging Bus, Database, Service Providers, etc).
  4. High availability strategy must be able to support sustainability of the ESB system in order to serve the request.

For example, our system consists of four ESBs. The threshold of the processing unit on each ESB depends on the policy we use:

  1. One-to-one pair - means that one ESB will be a fault tolerant system for one primary ESB. In this policy, processing unit of primary ESB can be 100% capacity since the secondary ESB can only hold a single primary ESB capacity. But this approach it very expensive, so we must have a backup for every single primary ESB.

    Figure 3 – One-to-One Pair
  2. N+1 - means that there is one ESB that is becoming the secondary ESB for all primary ESBs. If there is one ESB fails, then it should fall over to the secondary ESB. In this policy, the processing unit of primary ESB should not exceed 100/N % capacity, since the secondary ESB has to hold N-primary ESB capacity.


    Figure 4 – N+1 Policy

Unlike the processing unit, the memory of the ESB is much more straightforward. We just need to have available paging space for services to allocate memory. For example, we have 64 GB of memory and in average our services in one primary ESB need 128 MB to runs at 150 tps. Our single primary ESB can serve until 512 services runs at 150 tps. But, we should also put a threshold for memory, not utilizing it until 100%. For example, if it was 60% memory utilized, we should put another memory unit on ESB.


Messaging Bus Capacity Planning

Just like the Enterprise Service Bus, capacity planning on the Messaging Bus will also include processing unit, memory, and high availability features. This creates high availability features that are more or less the same as the Enterprise Service Bus. However, the Messaging Bus’ memory unit is not as straightforward as the ESB memory unit. There are additional aspects we also need to consider, such as storage size which is for persisting messages.

Capacity planning of memory on the Messaging Bus correlates with the message throughput (input and output). It will use memory to retrieve messages from the service producer and send it to the service consumer to keep the performance. For example, in our previous chapter, a message will have 12,204 bytes size. If a service producer can create the message up to the 150 tps rate, we will need at least around 1.74 MB of memory to holds the message for sending/receiving activity. Sometimes there is a condition where service consumer is not available (i.e. periodic maintenance, restarting instances). In this case, all messages that are produced by the service producer should not be send instantly, hence keeping it into persistent until service consumer become available again. With this mechanism, we will not lose any messages/transactions. The persistent size unit depends on how long the services consumer is usually unavailable, and also depends on the message rate itself. For example, service consumer in the worst case is not available for 3 hours, while the transaction loads are at 150 tps. So we will need to have 18,792 MB of persistent unit (storage).


Message Size: 1,74 MB

# Hours
# Load (tps)
# Persistent Size (MB)
2
150
1,879,200
3
150
2,818,800
4
150
3,758,400
5
150
4,698,000
6
150
15,637,600

Table 4 – Persistent Unit Size Needed

Database Capacity Planning

The Database is usually used to keep business logs, application configuration, and transaction checkpoint. Business logs are kept for the purpose of tracing transaction or troubleshooting of the production runtime environment. Application configuration refers to all the configurations that are used by the application when starting/setting up their runtime environment. And the transaction checkpoint is used by specific application to maintain the transaction state whenever the transaction is failing.

Capacity planning of database is slightly different with the other two components, beside the processing unit and high availability feature. In the enterprise level, the database usually keeps the data on persistent storage that installed separately with it. Some key aspects when defining storage capacity are:

  1. Traffic distribution - this is where traffic pattern is applied on runtime production. We can divide it into two segments:
    a. Non busy hours, will have 12 hours (5%) and 4 hours (10%) distribution pattern
    b. Busy hours, will have 4 hours (50%) at morning, 2 hours (80%) at noon, and 2 hours (100%) at night distribution pattern.
  2. Record size of a transaction - this is the value needed when we store one business logs in the database system, including:
    a. Database overhead, used to define overhead value for clob, lob, or other byte data.
    b. Redo log
    c. Archive log
    d. Index
  3. Data retention - defines how long we need to keep our business logs in persistent storage.

Figure 5 – Database Storage Architecture Using SAN

For example, we have requirement to keep business logs for 45 days for 150 tps loads. And one log record contains at maximum 15.5 KB of data. We can do capacity estimation for the storage that we need by using following formulation:


Number of TPS
150
Data Retention
45

Estimated Peak TPS

12 Hours
Traffic 5%
5% <0.05>
x   150
x   3.600
x   12
=   324.000



4 Hours
Traffic 10%
10% <0.1>
x   150
x   3.600
x   4
=   216.00



4 Hours
Traffic 50%
50% <0.5>
x   150
x   3.600
x   4
=   1.080.000



2 Hours
Traffic 80%
80% <0.8>
x   150
x   3.600
x   2
=   864.000



2 Hours
Peak Traffic
100% <1>
x   150
x   3.600
x   2
=   1.080.000

Traffic 5% for 12 hours    324.000
Traffic 10% for 4 hours + 216.000
Traffic 50% for 4 hours + 1.080.000
Traffic 80% for 2hours + 864.000
Traffic 100% for 2 hours + 1.080.000
   
Estimate Total Daily Traffic    3.564.000
Log Size per record (KB)    15,5
Size (KB)    324.000
DB Overhead    3
Total Size (KB)    165.726.000
     1.048.576
Total Size Daily (GB)    158
   
Data Retention    45
Grand Total (GB)    7.112

Table 5 – Storage Capacity Formulation

Conclusion

Database system/storage is not a mandatory component in SOA-based system, but it is very helpful for us to see what is happening in our production environment. Most of it is for providing data to management in regards to how much revenue is being produces by SOA-based system. By knowing this, they can make decisions on whether it is worthy to put the service into a SOA-based system, considering all cost benefit.