Streaming Data Book Cover

Streaming Data: Understanding the real-time pipeline by Andrew G. Psaltis is a book that discusses different components in a data-streaming service. This post is a reading note for the book.

Chapter 1: Introducing streaming data

  • Hard-, soft-, near-real-time systems are classified by tolerance for delay.
  • A streaming data system is a non-hard real-time system that makes its data available at the moment a client application needs it. Or think about it as an in the moment system that delivers the data at the point in time when it is needed.
  • Architecturally, a streaming data system has the following components:
    • Collection tier
    • Message queuing tier
    • Analysis tier
    • Long-term storage tier
    • In-memory data store tier
    • Data access
  • Vertical scale: increase capacity of existing server; horizontal scale: add servers

Chapter 2: Getting data from clients: data ingestion

  • Paradigms of getting data
    • Request/response pattern. Can be synchronous. The client or server can be implemented asynchronously to make it “half-async”, or both the client and the server can be implemented asynchronously to make it “full-async”.
    • Request/acknowledge pattern. Instead of returning a response to the client, the server returns an acknowledgment with a unique identifier that can be used later by the client to determine status of request or get results.
    • Publish/subscribe pattern. Using topics and subscriptions to segregate publishers and consumers.
    • One-way pattern. “Fire and forget.” Useful for example for metrics data gathering, or for high frequency but loss-tolerant data, etc.
    • Streaming pattern. The client becomes the “stream source”, the server sends the “client” a request so as to start/stop consuming streams of data from the “client.
  • Example: http://stream.meetup.com/2/rsvps
  • Scaling the interaction patterns
    • Request/response optional pattern: Use horizontal scaling & make the service stateless.
    • Stream pattern: Vertically scale up the collection node. Use a buffer in the collection layer. Horizontally scale to process nodes in the buffer.
  • Fault tolerance
    • Check-pointing & logging.
    • Receiver-based message logging (RBML): once a message reaches the collection layer, log it first.
    • Sender-based message logging (SBML): log after messages are processed by the collection layer and before sending out to downstream processing (in message-queuing layer).
    • Hybrid message logging: RBML + SBML.

Chapter 3: Transporting the data from collection tier: decoupling the data pipeline

  • For the decoupling of different components in the system.
  • Producer -> Broker -> Consumer.
  • Broker is backed by a message queue.
  • Issues for consideration
    • Durability of messages in the broker
    • Message delivery semantics: at most once, at least once, exactly once, etc.
  • Points of failures include: producer, connection to broker, broker itself, message queue, connection to consumer, and consumer.
  • What if parts of the layer (i.e., brokers) crash?
  • Examples: fraud detection, internet of things, recommendations.

Chapter 4: Analyzing streaming data

  • Data in-flight: from input source to the output to a client in next layer.
  • Data at-rest: persisted data.
  • In non-streaming systems, data are stored in DB, and being queried at intervals.
  • In streaming systems, queries are evaluated continuously as new data arrive; the result of the query is pushed to the client.
  • A generalized stream-processing architecture: a stream manager registers stream processors, where each stream processor is connected to multiple input/output streams. Examples (each is a slight variant): Apache Spark, Apache Storm, Apache Flink, Apache Samza.
  • State management: from as simple as in-memory to complex situations such as persistent storage. Also dependent on use case; e.g.,g “ad impression” stream comes before “ad clicks” stream, so should be saved in state for later queries.

Chapter 5: Algorithms for data analysis

  • Constraints
    • one-pass: only see data once
    • concept-drift: stats of stream data drift over time
    • resource constraints: can’t process all data
    • domain constraints: e.g., 100MM users, squared (relationship between any two users)
  • Time
    • stream-time: time at which an event enters the streaming system
    • event-time: when the event occurs
    • time skew: variance of above two timestamps
    • window: a certain amount of data that computations can be performed
      • trigger policy: when a window of data needs to trigger computation
      • eviction policy: when a data point should be evicted from the window
      • sliding window: one with trigger policy and eviction policy based on time, decided by window length (determines eviction) and sliding interval (determines trigger)
      • tumbling window: window full (eviction) and count-based (triggers whenever X element in window) or temporal-based (triggers when X seconds elapse) trigger policy
  • Summarization techniques
    • Random sampling: keep a reservoir of \(k\) elements, when the \((k+1)^\text{th}\) element comes, draw a random number between 0 and 1 and compare it with \(\frac{k}{k+1}\), if the random number is greater, then keep the new element in the reservoir and evict a random one, otherwise discard the element.
    • Counting distinct elements: HyperLogLog algorithm.
    • Frequency of elements: CountMin Sketch algorithm.
    • Membership: Bloom filter.

Chapter 6: Storing the analyzed or collected data

  • When to write data
    • As data come
    • Held in analysis tier before written in batches.
    • Held in queuing layer and async written in batches.
  • Direct writing: write data as they come / are done analyzing
  • Indirect writing: held up and written to storage in separate processes
  • Keeping data in-memory
    • In-memory of stream processors
    • Caching
      • Read-through: read from a persistent storage at a cache miss
      • Refresh-ahead: cache refreshes recently accessed data before it’s expired and evicted
      • Write-through: cache directly writes updated data to backing store
      • Write-around: cache writes to persistent store; other process responsible for updating other copies of caches
      • Write-back: cache acknowledges an update, and update the persistent store in the background
    • In-memory databases: much faster than traditional databases (memory access vs OS file access / physical medium access)
  • Use cases: in-session personalization; energy company analytics

Chapter 7: Making the data available

  • Communication patterns (see Chapter 2)
    • Data sync: initial sync to have existing data; follow-up sync changes delta
    • RMI & RPC
    • Messaging
    • Publish-Subscribe
  • Protocols to use to send data to the client
    • Web-hooks: streaming data API registers a callback for the streaming client, and monitors the data stream; if there is a change, the API sends the data back by calling the client’s callback.
    • HTTP Long Polling: client maintains an open connection to the streaming API server; the server sends data back to client as they are available.
    • Server-sent events (SSE): similar to long polling, but client may decide to go to power saving mode and delegates events streaming to a push proxy
    • Web Sockets: open connection maintained between clients and server, wrapped around by TCP handshake massages, and messages in between the open/close handshakes are bi-directional.
Protocol Message Frequency Communication Direction Message Latency Efficiency Fault Tolerance / Reliability
Web-hooks Low Uni-directional (server to client) Average Low None
HTTP Long Polling Average Bi-directional Average Average None
Server-sent Events High Uni-directional Low High None by default
Web Sockets High Bi-directional Low High None by default
  • Filtering the stream
    • Static filtering: how to filter is per-determined. Like a view in a relation database
    • Dynamic filtering: filtering is decided at run-time and the streaming client can drive it.
  • Case study: the Meet-Up API

Chapter 8: Consumer device capabilities and limitations accessing the data

  • Three types of applications
    • Dashboard/application
    • Integrate with 3rd party system
    • Process stream
  • Design considerations
    • Reading data fast enough
    • Maintaining states
    • Mitigating data loss
    • Exactly-once processing
  • Case study: SuperMediaMarkets
  • Query language support at the streaming API: needs to build a proxy streaming API that handles the queries

Chapter 9: Analyzing Meet-up RSVPs in real time

This chapter builds the Meet-up streaming pipeline. It would be nice if I could have the time to actually build it as a small project. Skipping it for now.