Chapter 1: Introducing streaming data

Hard-, soft-, near-real-time systems are classified by tolerance for delay.
A streaming data system is a non-hard real-time system that makes its data available at the moment a client application needs it. Or think about it as an in the moment system that delivers the data at the point in time when it is needed.
Architecturally, a streaming data system has the following components:
- Collection tier
- Message queuing tier
- Analysis tier
- Long-term storage tier
- In-memory data store tier
- Data access
Vertical scale: increase capacity of existing server; horizontal scale: add servers

Chapter 2: Getting data from clients: data ingestion

Paradigms of getting data
- Request/response pattern. Can be synchronous. The client or server can be implemented asynchronously to make it “half-async”, or both the client and the server can be implemented asynchronously to make it “full-async”.
- Request/acknowledge pattern. Instead of returning a response to the client, the server returns an acknowledgment with a unique identifier that can be used later by the client to determine status of request or get results.
- Publish/subscribe pattern. Using topics and subscriptions to segregate publishers and consumers.
- One-way pattern. “Fire and forget.” Useful for example for metrics data gathering, or for high frequency but loss-tolerant data, etc.
- Streaming pattern. The client becomes the “stream source”, the server sends the “client” a request so as to start/stop consuming streams of data from the “client.
Example: http://stream.meetup.com/2/rsvps
Scaling the interaction patterns
- Request/response optional pattern: Use horizontal scaling & make the service stateless.
- Stream pattern: Vertically scale up the collection node. Use a buffer in the collection layer. Horizontally scale to process nodes in the buffer.
Fault tolerance
- Check-pointing & logging.
- Receiver-based message logging (RBML): once a message reaches the collection layer, log it first.
- Sender-based message logging (SBML): log after messages are processed by the collection layer and before sending out to downstream processing (in message-queuing layer).
- Hybrid message logging: RBML + SBML.

Chapter 3: Transporting the data from collection tier: decoupling the data pipeline

For the decoupling of different components in the system.
Producer -> Broker -> Consumer.
Broker is backed by a message queue.
Issues for consideration
- Durability of messages in the broker
- Message delivery semantics: at most once, at least once, exactly once, etc.
Points of failures include: producer, connection to broker, broker itself, message queue, connection to consumer, and consumer.
What if parts of the layer (i.e., brokers) crash?
Examples: fraud detection, internet of things, recommendations.

Chapter 4: Analyzing streaming data

Data in-flight: from input source to the output to a client in next layer.
Data at-rest: persisted data.
In non-streaming systems, data are stored in DB, and being queried at intervals.
In streaming systems, queries are evaluated continuously as new data arrive; the result of the query is pushed to the client.
A generalized stream-processing architecture: a stream manager registers stream processors, where each stream processor is connected to multiple input/output streams. Examples (each is a slight variant): Apache Spark, Apache Storm, Apache Flink, Apache Samza.
State management: from as simple as in-memory to complex situations such as persistent storage. Also dependent on use case; e.g.,g “ad impression” stream comes before “ad clicks” stream, so should be saved in state for later queries.

Chapter 5: Algorithms for data analysis

Constraints
- one-pass: only see data once
- concept-drift: stats of stream data drift over time
- resource constraints: can’t process all data
- domain constraints: e.g., 100MM users, squared (relationship between any two users)
Time
- stream-time: time at which an event enters the streaming system
- event-time: when the event occurs
- time skew: variance of above two timestamps
- window: a certain amount of data that computations can be performed
  - trigger policy: when a window of data needs to trigger computation
  - eviction policy: when a data point should be evicted from the window
  - sliding window: one with trigger policy and eviction policy based on time, decided by window length (determines eviction) and sliding interval (determines trigger)
  - tumbling window: window full (eviction) and count-based (triggers whenever X element in window) or temporal-based (triggers when X seconds elapse) trigger policy
Summarization techniques
- Random sampling: keep a reservoir of \(k\) elements, when the \((k+1)^\text{th}\) element comes, draw a random number between 0 and 1 and compare it with \(\frac{k}{k+1}\), if the random number is greater, then keep the new element in the reservoir and evict a random one, otherwise discard the element.
- Counting distinct elements: HyperLogLog algorithm.
- Frequency of elements: CountMin Sketch algorithm.
- Membership: Bloom filter.

Chapter 6: Storing the analyzed or collected data

When to write data
- As data come
- Held in analysis tier before written in batches.
- Held in queuing layer and async written in batches.
Direct writing: write data as they come / are done analyzing
Indirect writing: held up and written to storage in separate processes
Keeping data in-memory
- In-memory of stream processors
- Caching
  - Read-through: read from a persistent storage at a cache miss
  - Refresh-ahead: cache refreshes recently accessed data before it’s expired and evicted
  - Write-through: cache directly writes updated data to backing store
  - Write-around: cache writes to persistent store; other process responsible for updating other copies of caches
  - Write-back: cache acknowledges an update, and update the persistent store in the background
- In-memory databases: much faster than traditional databases (memory access vs OS file access / physical medium access)
Use cases: in-session personalization; energy company analytics

Chapter 7: Making the data available

Communication patterns (see Chapter 2)
- Data sync: initial sync to have existing data; follow-up sync changes delta
- RMI & RPC
- Messaging
- Publish-Subscribe
Protocols to use to send data to the client
- Web-hooks: streaming data API registers a callback for the streaming client, and monitors the data stream; if there is a change, the API sends the data back by calling the client’s callback.
- HTTP Long Polling: client maintains an open connection to the streaming API server; the server sends data back to client as they are available.
- Server-sent events (SSE): similar to long polling, but client may decide to go to power saving mode and delegates events streaming to a push proxy
- Web Sockets: open connection maintained between clients and server, wrapped around by TCP handshake massages, and messages in between the open/close handshakes are bi-directional.

Protocol	Message Frequency	Communication Direction	Message Latency	Efficiency	Fault Tolerance / Reliability
Web-hooks	Low	Uni-directional (server to client)	Average	Low	None
HTTP Long Polling	Average	Bi-directional	Average	Average	None
Server-sent Events	High	Uni-directional	Low	High	None by default
Web Sockets	High	Bi-directional	Low	High	None by default

Filtering the stream
- Static filtering: how to filter is per-determined. Like a view in a relation database
- Dynamic filtering: filtering is decided at run-time and the streaming client can drive it.
Case study: the Meet-Up API

Chapter 8: Consumer device capabilities and limitations accessing the data

Three types of applications
- Dashboard/application
- Integrate with 3rd party system
- Process stream
Design considerations
- Reading data fast enough
- Maintaining states
- Mitigating data loss
- Exactly-once processing
Case study: SuperMediaMarkets
Query language support at the streaming API: needs to build a proxy streaming API that handles the queries

Chapter 9: Analyzing Meet-up RSVPs in real time

This chapter builds the Meet-up streaming pipeline. It would be nice if I could have the time to actually build it as a small project. Skipping it for now.