22 September 2013

The Value of Streaming Processing and Spring XD

I really enjoyed GridGain CEO Nikita Ivanov's post on Four Myths of In-Memory Computing. He nicely explains some of the applications of in-memory computing. While all his points were good, I was really happy to see him touch on stream processing as a use case.

I also think that in-memory databases are important use case… for today. They solve a specific use case that everyone readily understands, i.e. faster system of records. It’s sort of a low hanging fruit of in-memory computing and it gets in-memory computing popularized. I do, however, think that the long term growth for in-memory computing will come from streaming use cases. Let me explain. Streaming processing is typically characterized by a massive rate at which events are coming into a system. Number of potential customers we’ve talked to indicated to us that they need to process a sustained stream of up to 100,000 events per second with out a single event loss. For a typical 30 seconds sliding processing window we are dealing with 3,000,000 events shifting by 100,000 every second which have to be individually indexed, continuously processed in real-time and eventually stored. This downpour will choke any disk I/O (spinning or flash). The only feasible way to sustain this load and corresponding business processing is to use in-memory computing technology. There’s simply no other storage technology today that support that level of requirements. So we strongly believe that in-memory computing will reign supreme in streaming processing.

I've also heard this field roughly called data integration.

It's a very vibrant field. I like the way Michael E. Driscoll (@medriscoll), CEO at Metamarkets, put it:

Spring Integration makes it dead simple to integrate with various messaging systems (JMS, AMQP, STOMP, MQTT, websockets, Twitter, Kafka, etc.) to build pipe-and-filter architectures. Its API elements mirror the patterns of the same name in Gregor Hohpe and Bobby Woolf's canonical tome on the subject, Enterprise Integration Patterns. In Spring Integration, Messages flow from one component - a splitter, or router, an aggregator, a transformer, etc. - along channels. Components are decoupled in that they communicate only through Messages. Messages in turn have headers - a map of metadata about the payload - and a payload.

Spring Batch has great support for managing the state and orchestration of long-running, data-centric jobs. It has the ability to work with a variety of systems where input and output is most efficiently done in batch. Spring Batch supports the notion of jobs, which are composed of a sequence of steps. Each step can optionally read data, optionally process, and optionally write data. So, for example: one step might read data (lots of data! Millions of records! Batch will scale..) from, for example, a SQL database or a large tab delimited file. Once the data's read, a natural next step is to process it and then - once finished - write the changes somewhere.

Data integration, or stream processing, or ingestion, etc. is all about managing the integration with, and accquisition of, data from varied systems and supporting its ingestion, analysis, processing and ultimate storage.

As Nikita points out, with data storage options so cheap these days, we can record as much data as we want. The real question is: how do we process it? How do we extract value out of it? Sure, it's easy to put hundreds of terabytes of Hadoop data onto an HDFS data lake, but how do you transform that data into business value? How do you integrate with other systems - online systems, warehouses, reporting? How do you accommodate the ingest of new data even in the face of a tidal data deluge?

To do this right, you need data processing support (extraction, transformation, and loading) and a event- and messaging-centric programming model to stitch together otherwise decoupled components in distributed, messaging-centric workflows. This is where Spring XD comes in. Spring XD is a new project in the stream processing space. It builds on the strengths of Spring Batch, Spring Integration, and Spring Data and Spring for Hadoop to meet this new generation of challenges.

That picture's fairly marketitecture-ish, but it does a nice job of sort of visualizing the place of Spring XD in your architecture and in understanding it, you already understand the programming model of Spring XD: streams represent the flow of data from a source to a sink. A source is some point of entry, like a database, syslog, HDFS, etc. A sink represents the place where the data ultimately gets written to. You can put in processors along the stream to process, transform, and audit the data. A tap is a component that intercepts the data, but doesn't terminate the stream. In the integration world, the closest analog is a wire tap.

I'll let you read this introductory blog post for more.