The Problem: Why Stream Processing Needs Proper Patterns
Modern applications generate events at unprecedented scale. A typical e-commerce platform might process millions of events per hour - user clicks, inventory updates, payment transactions, shipment notifications. The challenge is not just handling volume; it is extracting meaning from these events in real-time.
- Data Correlation - Events from different sources need to be joined to create complete pictures (orders with payments, users with their actions)
- Time-Based Analysis - Business questions often involve time windows: "How many orders in the last 5 minutes?" or "What is the rolling average transaction value?"
- Stateful Computations - Aggregations require maintaining state across events, which becomes complex in distributed systems
- Decoupled Processing - Teams want to deploy business logic independently without rewriting infrastructure code
Raw Kafka consumer APIs provide the building blocks, but implementing these patterns correctly requires understanding distributed state management, exactly-once semantics, and fault tolerance. This is where Kafka Streams and Spring Cloud Stream shine.
The Solution: Four Essential Stream Processing Patterns
We built four focused implementations that address the core patterns every stream processing system needs. Each project demonstrates a specific pattern while sharing a common foundation: Spring Cloud Stream with the Kafka binder.
Joins
Correlate events from multiple streams in real-time
Windows
Analyze events within time boundaries
Aggregations
Compute running totals and statistics
Routing
Dynamic dispatch to serverless functions
Pattern 1: Stream Joins - Correlating Related Events
The Use Case
Consider an order processing system. Order events arrive on one topic, payment confirmations on another. To build a complete order view, you need to join these streams based on a common key (order ID). This is not a database join - both events are in motion, arriving at unpredictable times.
How Spring Cloud Stream Handles It
The kafka-stream-join project demonstrates stateful stream processing where Kafka Streams maintains a local state store to hold events from one stream while waiting for matching events from another.
@Bean
public BiFunction<KStream<String, Order>, KStream<String, Payment>,
KStream<String, EnrichedOrder>> processOrderPayment() {
return (orders, payments) -> orders
.join(payments,
(order, payment) -> new EnrichedOrder(order, payment),
JoinWindows.of(Duration.ofMinutes(5)),
StreamJoined.with(Serdes.String(), orderSerde, paymentSerde))
.peek((key, value) -> log.info("Joined: {}", value));
}
Join Types Explained
| Join Type | Behavior | Use Case |
|---|---|---|
| Inner Join | Emits only when both streams have matching keys | Orders that have been paid |
| Left Join | Emits for every left event, null if no match | All orders, with payment if available |
| Outer Join | Emits for both streams, nulls where no match | Complete event audit trail |
The critical parameter is the join window - how long to wait for a matching event. Set it too short, and you miss legitimate matches. Set it too long, and state stores grow unbounded. In production, we typically use 5-15 minute windows based on business SLAs.
Pattern 2: Windowing - Time-Bounded Analysis
The Use Case
A shipment tracking system needs to count how many packages are being processed per minute. But "per minute" can mean different things: strict minute boundaries (tumbling windows), overlapping intervals (hopping windows), or activity-based grouping (session windows).
Rolling vs. Hopping Windows
The kafka-stream-window project implements both patterns with configurable parameters:
# Tumbling Window (non-overlapping)
# Window closes every 60 seconds, counts reset
--spring.cloud.stream.kafka.streams.timeWindow.length=60000
# Hopping Window (overlapping)
# Window length: 60s, advances every 40s
# Creates overlapping windows for smoother aggregations
--spring.cloud.stream.kafka.streams.timeWindow.length=60000
--spring.cloud.stream.kafka.streams.timeWindow.advanceBy=40000
Tumbling Window
|--W1--|--W2--|--W3--|
Fixed, non-overlapping intervals
Hopping Window
|----W1----|
|----W2----|
|----W3----|
Overlapping for smoother trends
@Bean
public Function<KStream<String, Shipment>, KStream<Windowed<String>, Long>>
trackShipments() {
return shipments -> shipments
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1))
.advanceBy(Duration.ofSeconds(40)))
.count(Materialized.as("shipment-counts"))
.toStream()
.peek((window, count) ->
log.info("Window {} - {} shipments",
window.window().startTime(), count));
}
Hopping windows are particularly useful for dashboards where you want smooth trend lines rather than jagged minute-by-minute counts. The overlap means each event contributes to multiple windows, providing a moving average effect.
Pattern 3: Aggregations - Computing Running Statistics
The Use Case
Real-time analytics often requires computing aggregates: total revenue, average order value, count of events by category. Unlike batch processing, stream aggregations must update incrementally as each event arrives.
Stateful Stream Processing
The kafka-stream-aggregation project demonstrates how Kafka Streams maintains local state stores (backed by RocksDB) to enable fault-tolerant aggregations:
@Bean
public Function<KStream<String, Transaction>, KTable<String, TransactionStats>>
aggregateTransactions() {
return transactions -> transactions
.groupBy((key, txn) -> txn.getCategory())
.aggregate(
TransactionStats::new, // Initializer
(key, txn, stats) -> { // Aggregator
stats.incrementCount();
stats.addAmount(txn.getAmount());
stats.updateAverage();
return stats;
},
Materialized.<String, TransactionStats, KeyValueStore<Bytes, byte[]>>
as("transaction-stats-store")
.withKeySerde(Serdes.String())
.withValueSerde(transactionStatsSerde));
}
State Store Architecture
| Component | Purpose | Fault Tolerance |
|---|---|---|
| RocksDB | Local state storage | Persistent across restarts |
| Changelog Topic | State replication | Enables state recovery |
| Standby Replicas | Warm standby instances | Fast failover |
The beauty of this approach is that aggregations are automatically distributed across partitions. If you have 10 partitions and 5 application instances, each instance handles 2 partitions worth of aggregation state, enabling horizontal scaling.
Pattern 4: Serverless Routing - Dynamic Function Dispatch
The Use Case
Modern architectures often separate event ingestion from processing logic. Teams want to deploy business functions independently without modifying the Kafka consumer infrastructure. This is the "bring your function to the cluster" paradigm.
The Router Architecture
The kafka-streams-router project bridges Kafka topics with OpenFaaS serverless functions. Functions register their interest in topics via annotations, and the router handles message delivery:
functions:
order-processor:
lang: java11
handler: ./order-processor
image: registry/order-processor:latest
annotations:
topic: orders-topic
payment-handler:
lang: java11
handler: ./payment-handler
image: registry/payment-handler:latest
annotations:
topic: payments-topic
Kafka Topic
Events arrive on topic
Stream Router
Reads @topic annotations
OpenFaaS Gateway
Invokes matching functions
Business Logic
Function processes event
Zero Kubernetes YAML
The project uses jkube for Kubernetes deployment, eliminating manual YAML configuration:
# Build, push, and deploy in one command
mvn k8s:build k8s:push k8s:resource k8s:apply -Pkubernetes
# Test the router
curl "http://localhost:8081/router?message=test-event"
# Response shows routing result
{
"message": "test-event",
"topic": "orders-topic",
"functions": ["order-processor"]
}
Technology Foundation
| Layer | Technology | Purpose |
|---|---|---|
| Framework | Spring Cloud Stream | Messaging abstraction with binder architecture |
| Stream Processing | Kafka Streams | Stateful transformations with exactly-once semantics |
| Message Broker | Apache Kafka | Distributed event log with partitioning |
| Serverless | OpenFaaS | Function deployment and scaling |
| Orchestration | Kubernetes (k3d) | Container orchestration and service discovery |
| Kafka on K8s | Strimzi | Kubernetes operator for Kafka clusters |
| Build & Deploy | jkube | Zero-YAML Kubernetes deployment from Maven |
Why Spring Cloud Stream?
Spring Cloud Stream provides a programming model that decouples business logic from messaging infrastructure. The same code can run against Kafka, RabbitMQ, or other binders by changing configuration - no code changes required.
# Switch from Kafka to RabbitMQ by changing the binder
spring:
cloud:
stream:
bindings:
processInput-in-0:
destination: input-topic
binder: kafka # Change to 'rabbit' for RabbitMQ
processOutput-out-0:
destination: output-topic
binder: kafka
Getting Started
All four projects follow a consistent setup pattern:
# 1. Start Kafka infrastructure
docker-compose up -d
# 2. Build the application
./mvnw clean package
# 3. Run with Spring Boot
java -jar target/kafka-stream-*.jar
# 4. For windowing, configure window parameters
java -jar target/kafka-stream-window-*.jar \
--spring.cloud.stream.kafka.streams.timeWindow.length=60000 \
--spring.cloud.stream.kafka.streams.timeWindow.advanceBy=40000
# 5. Verify output with Kafka console consumer
kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic output-topic \
--property print.key=true \
--property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
--property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
Pattern Selection Guide
Use Joins When
You need to correlate events from different sources - orders with payments, users with their actions, sensors with their metadata. Critical for building complete entity views from partial events.
Use Windows When
Business questions involve time boundaries - "events per minute," "rolling averages," "peak hours." Essential for dashboards, alerting thresholds, and trend analysis.
Use Aggregations When
You need running totals, counts, or statistics that update in real-time. Powers analytics dashboards, fraud detection counters, and inventory tracking.
Use Routing When
Multiple teams need to process the same event stream with different logic, or when you want to deploy processing functions independently of the ingestion layer.
Production Considerations
Key Production Settings
spring:
cloud:
stream:
kafka:
streams:
binder:
configuration:
# Exactly-once semantics
processing.guarantee: exactly_once_v2
# Commit interval for state stores
commit.interval.ms: 100
# Number of standby replicas for HA
num.standby.replicas: 1
bindings:
process-in-0:
consumer:
# Consumer group for load balancing
application-id: my-stream-app
Explore the Code
Each project is available on GitHub with complete source code and Docker Compose configurations for local development.