Event Streams in Action: Real-time event systems with Kafka and Kinesis
Book Details
Full Title: Event Streams in Action: Real-time event systems with Kafka and Kinesis
Author: Alexander Dean; Valentin Crettaz
ISBN/URL: 9781617292347
Reading Period: 2020.11–2020.11.22
Source: Searching on Amazon for Kafka-related books
General Review
-
This book provides a good mix of conceptual discussion and reference implementations at a high level.
-
For example, it provides a conceptual framework on how events may be modelled as subject-verb-object like customer add item to cart, but does not provide details on any alternative ways of modelling events, and the relative strengths.
-
Another example (regarding lack of technical depth) is that some code sample uses soft-linking to locally reference the avro schema files. A better way might have been to have a function that mock this out, stating clearly that the schema would be obtained from a registry.
-
Specific Takeaways
Chapter 10 Analytics-on-read
-
Generally, the various tools and workflows for performing analytics on a unified log falls under one of two types: analytics-on-read and analytics-on-write.
Analytics-on-read
-
Generally follows a two-step process:
-
Write all the events to an event store.
-
Read the events to perform an analysis.
-
-
The implementation has three key parts:
-
A storage target to which the events will be written.
-
A schema, encoding or format in which the events should be written.
-
A query engine or data processing framework to allow analyze the events by reading from the storage target.
Analytics-on-write
-
-
An analytics-on-write approach is able to provide the following characteristics:
-
Very low latency
-
Thousands of simultaneous users
-
Highly available
-
-
Some example uses cases involve the following:
-
Providing "live" data to dashboards of KPIs, perhaps up to latency measured in minutes, not days.
-
"Live" status tracking / reporting.
-
-
Generally follows a four-step process:
-
Read the events from the event stream.
-
Analyze the events by using a stream processing framework.
-
Write the summarized output of the analysis to a storage target.
-
Serve the summarized output into real-time dashboards or reports.
Comparison of Analytics-on-read vs Analytics-on-write
-
Analytics-on-read | Analytics-on-write |
---|---|
Predetermined storage format | Predetermined storage format |
Flexible queries | Predetermined queries |
High latency | Low latency |
Support 10-100 users | Support 10,000s of users |
Simple (eg., HDFS) or sophisticated (e.g., HP Vertica) storage target | Simple storage target (e.g., key-value store) |
Sophisticated query engine or batch processing framework | Simple (e.g., AWS Lambda) or sophisticated (e.g., Apache Samsa) stream processing framework |
Pros and Cons of Amazon Redshift
Pros | Cons |
---|---|
Scales horizontally | High latency even for simple queries |
Columnar storage allows for fast aggregation for data modelling and reporting | Retrieving rows is slow |
Append-only data loading from Amazon S3 or DynamoDB is fast | Updating existing data is painful |
Can prioritize important queries and users via workload management | Designed for tens of users, not hundreds or thousands |
Data volatility – Early join vs late join
-
When designing message schema, one of the decision is between (a) including the actual data point within the message itself, and (b) including only a reference key, and only enrich the message with the actual data point when processing.
-
An example of (a) in the ecommerce context may be to include the customer's name in the "Order Placed" event, whereas an example of (b) might be to include only the customer ID, and downstream processes can obtain the customer's name using the customer ID when required.
-
(a) would be early joining; (b) would be late joining
-
-
When deciding whether to use early or late joining, one thing to consider is the data volatility: does the data change frequently, infrequently, or never.
-
The more volatile the data is, the earlier the joining should occur.
Chapter 11 Analytics-on-write
-
DynamoDB supports conditional writes (i.e., only write if certain conditions are satisfied). This is useful when events in a stream may be processed out of order.
To Internalize Now
-
N.A.
To Learn/Do Soon
-
N.A.
To Revisit When Necessary
Chapter 10
Section 10.3
-
Refer to this section for a brief discussion on how to structure tables in Amazon Redshift: a fat table or shredded table.
-
Refer to this section for an example usage of Amazon Redshift.
Chapter 11
-
Refer to this chapter for an example of streaming data processing on Amazon Kinesis using Amazon Lambda and persisted onto Amazon DynamoDB with conditional writes.
Other Resources Referred To
-
N.A.