Reading Notes for Streaming Data

Streaming Data: Understanding the real-time pipeline by Andrew G. Psaltis is a book that discusses different components in a data-streaming service. This post is a reading note for the book.
Chapter 1: Introducing streaming data
- Hard-, soft-, near-real-time systems are classified by tolerance for delay.
- A streaming data system is a non-hard real-time system that makes its data available at the moment a client application needs it. Or think about it as an in the moment system that delivers the data at the point in time when it is needed.
- Architecturally, a streaming data system has the following
components:
- Collection tier
- Message queuing tier
- Analysis tier
- Long-term storage tier
- In-memory data store tier
- Data access
- Vertical scale: increase capacity of existing server; horizontal scale: add servers
Chapter 2: Getting data from clients: data ingestion
- Paradigms of getting data
- Request/response pattern. Can be synchronous. The client or server can be implemented asynchronously to make it “half-async”, or both the client and the server can be implemented asynchronously to make it “full-async”.
- Request/acknowledge pattern. Instead of returning a response to the client, the server returns an acknowledgment with a unique identifier that can be used later by the client to determine status of request or get results.
- Publish/subscribe pattern. Using topics and subscriptions to segregate publishers and consumers.
- One-way pattern. “Fire and forget.” Useful for example for metrics data gathering, or for high frequency but loss-tolerant data, etc.
- Streaming pattern. The client becomes the “stream source”, the server sends the “client” a request so as to start/stop consuming streams of data from the “client.
- Example: http://stream.meetup.com/2/rsvps
- Scaling the interaction patterns
- Request/response optional pattern: Use horizontal scaling & make the service stateless.
- Stream pattern: Vertically scale up the collection node. Use a buffer in the collection layer. Horizontally scale to process nodes in the buffer.
- Fault tolerance
- Check-pointing & logging.
- Receiver-based message logging (RBML): once a message reaches the collection layer, log it first.
- Sender-based message logging (SBML): log after messages are processed by the collection layer and before sending out to downstream processing (in message-queuing layer).
- Hybrid message logging: RBML + SBML.
Chapter 3: Transporting the data from collection tier: decoupling the data pipeline
- For the decoupling of different components in the system.
- Producer -> Broker -> Consumer.
- Broker is backed by a message queue.
- Issues for consideration
- Durability of messages in the broker
- Message delivery semantics: at most once, at least once, exactly once, etc.
- Points of failures include: producer, connection to broker, broker itself, message queue, connection to consumer, and consumer.
- What if parts of the layer (i.e., brokers) crash?
- Examples: fraud detection, internet of things, recommendations.
Chapter 4: Analyzing streaming data
- Data in-flight: from input source to the output to a client in next layer.
- Data at-rest: persisted data.
- In non-streaming systems, data are stored in DB, and being queried at intervals.
- In streaming systems, queries are evaluated continuously as new data arrive; the result of the query is pushed to the client.
- A generalized stream-processing architecture: a stream manager registers stream processors, where each stream processor is connected to multiple input/output streams. Examples (each is a slight variant): Apache Spark, Apache Storm, Apache Flink, Apache Samza.
- State management: from as simple as in-memory to complex situations such as persistent storage. Also dependent on use case; e.g.,g “ad impression” stream comes before “ad clicks” stream, so should be saved in state for later queries.
Chapter 5: Algorithms for data analysis
- Constraints
- one-pass: only see data once
- concept-drift: stats of stream data drift over time
- resource constraints: can’t process all data
- domain constraints: e.g., 100MM users, squared (relationship between any two users)
- Time
- stream-time: time at which an event enters the streaming system
- event-time: when the event occurs
- time skew: variance of above two timestamps
- window: a certain amount of data that computations can be performed
- trigger policy: when a window of data needs to trigger computation
- eviction policy: when a data point should be evicted from the window
- sliding window: one with trigger policy and eviction policy based on time, decided by window length (determines eviction) and sliding interval (determines trigger)
- tumbling window: window full (eviction) and count-based (triggers whenever X element in window) or temporal-based (triggers when X seconds elapse) trigger policy
- Summarization techniques
- Random sampling: keep a reservoir of \(k\) elements, when the \((k+1)^\text{th}\) element comes, draw a random number between 0 and 1 and compare it with \(\frac{k}{k+1}\), if the random number is greater, then keep the new element in the reservoir and evict a random one, otherwise discard the element.
- Counting distinct elements: HyperLogLog algorithm.
- Frequency of elements: CountMin Sketch algorithm.
- Membership: Bloom filter.
Chapter 6: Storing the analyzed or collected data
- When to write data
- As data come
- Held in analysis tier before written in batches.
- Held in queuing layer and async written in batches.
- Direct writing: write data as they come / are done analyzing
- Indirect writing: held up and written to storage in separate processes
- Keeping data in-memory
- In-memory of stream processors
- Caching
- Read-through: read from a persistent storage at a cache miss
- Refresh-ahead: cache refreshes recently accessed data before it’s expired and evicted
- Write-through: cache directly writes updated data to backing store
- Write-around: cache writes to persistent store; other process responsible for updating other copies of caches
- Write-back: cache acknowledges an update, and update the persistent store in the background
- In-memory databases: much faster than traditional databases (memory access vs OS file access / physical medium access)
- Use cases: in-session personalization; energy company analytics
Chapter 7: Making the data available
- Communication patterns (see Chapter 2)
- Data sync: initial sync to have existing data; follow-up sync changes delta
- RMI & RPC
- Messaging
- Publish-Subscribe
- Protocols to use to send data to the client
- Web-hooks: streaming data API registers a callback for the streaming client, and monitors the data stream; if there is a change, the API sends the data back by calling the client’s callback.
- HTTP Long Polling: client maintains an open connection to the streaming API server; the server sends data back to client as they are available.
- Server-sent events (SSE): similar to long polling, but client may decide to go to power saving mode and delegates events streaming to a push proxy
- Web Sockets: open connection maintained between clients and server, wrapped around by TCP handshake massages, and messages in between the open/close handshakes are bi-directional.
| Protocol | Message Frequency | Communication Direction | Message Latency | Efficiency | Fault Tolerance / Reliability |
|---|---|---|---|---|---|
| Web-hooks | Low | Uni-directional (server to client) | Average | Low | None |
| HTTP Long Polling | Average | Bi-directional | Average | Average | None |
| Server-sent Events | High | Uni-directional | Low | High | None by default |
| Web Sockets | High | Bi-directional | Low | High | None by default |
- Filtering the stream
- Static filtering: how to filter is per-determined. Like a view in a relation database
- Dynamic filtering: filtering is decided at run-time and the streaming client can drive it.
- Case study: the Meet-Up API
Chapter 8: Consumer device capabilities and limitations accessing the data
- Three types of applications
- Dashboard/application
- Integrate with 3rd party system
- Process stream
- Design considerations
- Reading data fast enough
- Maintaining states
- Mitigating data loss
- Exactly-once processing
- Case study: SuperMediaMarkets
- Query language support at the streaming API: needs to build a proxy streaming API that handles the queries
Chapter 9: Analyzing Meet-up RSVPs in real time
This chapter builds the Meet-up streaming pipeline. It would be nice if I could have the time to actually build it as a small project. Skipping it for now.