Philip Wik is a Senior Data Architect for American Express. Philip has worked for JP Morgan/Chase, Wells Fargo, Honeywell, Boeing, Intel, and other companies in a variety of applications development, integration, and architectural roles. He has published three books through Prentice-Hall:
How to Do Business With the People's Republic of China, How to Buy and Manage Income Property, and (along with other co-authors) Next Generation SOA.
Big Data as a Service Published: February 20, 2013 • Service Technology Magazine Issue LXX PDF
Abstract: The various ways in which SOA design principles can be synergized with Big Data are explored. Complex event processing, Apache Hadoop metadata management, scalable Infrastructure-as-a-Service (IaaS), and front-end analytics are among the methods that can render Big Data-as-a-Service (BDaaS).
What is Big Data?
The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents.
H.P. Lovecraft, "The Call of Cthulhu," 1928
The domain of information fueled not only by enterprise data such as general ledger accounting, Big Data is also propelled by sensor data, call data records, radio-frequency identification tracking, digital exhaust, and social media.
Big Data is stored in NoSQL databases. NoSQL, and by implication Big Data, is typically non-relational, distributed, open source, horizontally scalable, schema-free, eventually consistent, and massive in terms of data volume, velocity, and variety.
Big Data is realizing its promise. Load processing time can now be reduced by loading raw data into Hadoop. Sears, for example, shrunk its ETL time from 20 hours to 17 minutes [REF-1]. Insurance companies are tailoring premiums using data collected by sensors that record driving patterns, breaking abruptness, and milage. Retailers are embracing dynamic pricing models. Pricing is no longer tied to the manufacturer’s suggested retail price (MSRP), but is now more sensitive to supply and demand based on inventory levels, item sales velocity, advertising channels, and competitor’s pricing. The price displayed on a light-emitting diode for each item will fluctuate with the variability of gasoline prices or wheat futures, increasing both the margin for the seller and competitive opportunities for the buyer.
This realtime price jiggling can be seen in the sale of sports and airline tickets. That Target's IT figured out a teen girl was pregnant before her father did is a mildly unsettling reminder that Big Data is providing Big Brother actionable insight to drive commerce [REF-2].
Data volume is growing exponentially. In 2010, the world’s data volume passed one zettabyte—one followed by twenty-one zeroes, or one billion terabytes. By 2015, the estimated global volume will be about eight zettabytes. One billion new Facebook items and 200 million tweets each day suggest both the sheer volume and the variety of Big Data. These statistics pale in the face of the demand required by basic science investigation and national security data collection, where a result set must be rendered from exabytes in under a tenth of a second.
Architecting Big Data as-a-Service (BDaaS)
Let us consider what it means to architect SOA-based Big Data. We must first remind ourselves that SOA is a set of design principles and constraints, rather than just a cluster of Web services. JACOWS with Big Data only has meaning in the context of good service-oriented principles, as well as process integrity, extending open source technologies, and integration with enterprise capabilities. Componentization, separation of concerns, simplicity, and layering are other principles that support SOA and Big Data. As Big Data software continues to improve, changes can be made in the user interface, communications, data storage, and task processing layers without having to rebuild the entire architecture.
Business intelligence (BI) shares with Big Data many of the resolutions to problems that have allowed BI to become service oriented [REF-3], which technically include the use of adapters, metadata, data marts, and RESTfulness. Conceptually, this means following agility and employing the same kinds of presentation, business processes, and data service layers. The challenge of integrating Big Data into a data warehousing organization is more a matter of organizational change management than a technical problem.
Complex event processing (CEP) involves technologies that have the capacity to analyze data in motion in response to event triggers or changes in data patterns. These events are sometimes aggregated and coordinated through a business activity monitoring (BAM) rules engine. BAM generates alerts and reports in realtime as data points change. Events are similar to traditional messages exchanged by message-oriented middleware (MOM), requiring an event ontology. An event ontology is a formal specification within a shared domain that defines the process for sharing realtime events between components.
SOA principles include loose coupling of services, one-to-one correlation between providers and consumers, and usually synchronous (request-response) behavior. By contrast, event-driven architecture is characterized by decoupled interactions, many-to-many communication, event-controlled actions, and asynchronous operations. BDaaS embraces principles that are hooked to both SOA and to EDA. BDaaS can be regarded as an asynchronous SOA, with a complementary relationship between services and events.
Figure 1 – Big Data as a service data flow.
We can visualize BDaaS as a process as data flows through from the top down in a SOA. In Figure 1, Big Data flows into MapReduce collectors and is processed by CEP for rendering as a service on a scalable platform of a private virtual local area network.
One of the more widely used NoSQL solutions is Apache Cassandra [REF-4]. Facebook originated Cassandra, which is now in production at Twitter, Reddit, Digg, and other large sites. A Cassandra cluster can be built with hundreds of nodes that process petabytes of data and be integrated with Hadoop, an open source MapReduce implementation. Hadoop allows large data volumes to be processed, while keep the data in its original cluster. For example, the Hadoop Distributed File System (HDFS) stores Web logs. Other Hadoop platforms are Hive, for data warehousing, Pig, for data set analysis, Chukwa, for monitoring distributed systems that use HDFS, Flume to aggregate steaming data, and Sqoop to allow the bulk transfer of data between Hadoop and relational databases.
Elastic scaling through cloud virtualization also facilitates Big Data. A Big Data platform includes massively parallel-processing software that is running on multiple servers, petabyte-scaled platforms that collect, manage and analyze enterprise data, distributed data retention hardware, in-memory data, data-mining grids, neural networks, and NoSQL-data scalable storage. Big Data also relies on direct-attached storage in its various forms, from solid state disk to high capacity Serial/ATA disks inside parallel processing nodes and system on a chip (SOC) innovations. Hadoop was largely a technical response to the limitations of older analytic and storage architectures, such as storage area networks, sharding, parallel databases, and business intelligence tools.
Appliances and connectors can seamless enable the transactional integration of data and services. Web Services adapters allow Web services interfaces to application, database, or platform systems. Adapters can also transform non-XML Big Data formats into XML formats.
Figure 2 – Big Data and SOA.
An enterprise service bus orchestrates the processing of data from MapReduce collectors, complex event processing, business process management, and other Web services.
Companies are also developing niche tools to help understand Big Data, such as Splunk and Sumo Logic. Splunk analyzes machine-generated data in a searchable repository, from which graphs, reports, alerts, dashboards, and visualizations can be generated. Sumo Logic leverages Big Data logging to deliver realtime analytical insights of operational intelligence.
The Dark Side of Big Data
The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the light into the peace and safety of a new dark age.
H.P. Lovecraft, "The Call of Cthulhu", 1928
As architects, we seek to develop holistic solutions and model a design at the business, application, and technology levels using The Open Group Architectural Framework [REF-5]. However, missing from this is another dimension that is increasing in importance as Big Data becomes ubiquitous. Rarely in the technical literature do the words ethics and Big Data appear in the same sentence [REF-6]. We can define what Big Data is, but do we understand what Big Data means? As a matter of commercial self-interest in the context of our universal rights, we must address the ethics of Big Data.
The purpose of this section is not to give answers, but to raise questions. The following premises inform these questions:
The first presupposition gives us pause to an optimistic view of technology. Scientism is a pernicious dogma because its epistemology is rooted in rationality, while its ethics are relativistic and power-based. The mantra of this age is that if it can be done, it must be done, despite whatever values those actions may contradict. We see this in the medical profession, with experimentation in such areas as cloning, eugenics, and end-of-life choices. We also see it in the military, with its creation of chemicals, viruses, and fission that could one day allow roaches to inherit the earth. The binding of the worst impulses with the best technology of mankind is inexorable.
Vast databases that talk to other vast databases could erode our sphere of privacy to the point that privacy will cease to exist, even for those who believe that they are off the grid. Because of the ubiquity of sensors and cameras, the grid is our existence itself. The fact is that Big Data can concentrate and channel power in ways that we do not yet fully understand. It is one thing to track guns, but what about thoughts? Totalitarianism is perhaps less likely than the Brave New World consensus that safety, solidarity, serenity, and distraction are paramount and that privacy and individuality are superfluous.
H.P. Lovecraft, the erudite writer of horror, spoke of the safety of a new dark age as a response to the insight of hidden knowledge. The forbidden fruit of new knowledge is dangerous, not so much in its consumption as in its nurturing. The reality that is Big Data means that those shadowy institutions that fund Big Data may be acting in ways that subvert our values.
Intelligence agencies are integrating pipelines of raw data so as to connect the dots connect tactically, operationally, and strategically upon ingestion and upon consumption at any level. Human intelligence (HUMINT), signals intelligence (SIGINT), imagery intelligence (IMINT), measurement and signature intelligence (MASINT), and open source intelligence (OSINT) are baked in a way that is only hinted at by Hollywood. The private sector, such as Facebook and Google, is also harvesting personally identifiable information with equal alacrity.
Perhaps technical and morals ends can both be attained by building structures that promote accountability, transparency, and communication to customers, clients, stakeholders, and the public. Institutions that cannot address these concerns could pay a price in brand erosion.
Big Data offers the human mind the ability to correlate all of its contents, and therein lies great rewards and risks. The following questions require searching answers, and perhaps better questions. This effort of looking at Big Data as a normative question will allow us to architect a sound solution that meets both universal and technical goals.
Implications for the Future
Big Data, in combination with clouds, EDA, and SOA is defining the future of information technology [REF-7]. Although third-form normalized operational data stores will retain their value, Big Data will not only supplement, but will eventually replace, data warehousing star schemas, as infrastructure costs decline and presentation level analytical tools increase in both number and in sophistication.
[REF-1] "Big Data at Sears", Jnan Dash’s Weblog, November 11, 2012: http://swtrends.wordpress.com/2012/11/06/big-data-at-sears/
[REF-2] Kashmir Hill, "How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did", Forbes Magazine, February 16, 2012: http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/
[REF-3] Philip Wik, "Service-Oriented Architecture and Business Intelligence", Service Technology Magazine, August 11, 2011: http://servicetechmag.com/I53/0811-2
[REF-4] Philip Wik, Cassandra and Big Data", DataCenterAcceleration", September 18, 2012: http://www.datacenteracceleration.com/author.asp?section_id=2432&doc_id=250926&
[REF-5] "Using TOGAF to Define and Govern Service-Oriented Architectures", May, 2011: http://pubs.opengroup.org/architecture/togaf9-doc/arch/chap22.html
[REF-6] [REF-6] I first started to speculate about this angle of Big Data in a blog post "NoSQL" from October 2, 2012, DataCenterAcceleration: http://www.datacenteracceleration.com/author.asp?section_id=2432&doc_id=251664&
[REF-7] "Oracle Information Architecture: An Architect’s Guide to Big Data", Oracle, August, 2012: http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf
I wish to thank Markus Zern, Vice President of Product and Solutions Management at Splunk, who reviewed and critiqued my paper.