books

Streaming Architecture

Ted Dunning and Ellen Friedman

84 highlights

Highlights & Annotations

Data about the car’s function can be used for predictive maintenance or to alert insurance companies about the driver and vehicle performance.

Ref. 0C09-A

Knowing that there is a slow-down caused by an accident on a particular freeway during the morning commute is useful to a driver while the incident and its effect on traffic are happening.

Ref. FEF8-B

With a well-designed project, it is possible to monitor a large variety of things that take place in a system.

Ref. 60DB-C

These actions might include the transactions involving a credit card or the sequence of events related to logins for a banking website.

Ref. 9CB2-D

Some of the broader advantages require durability: you need a message-passing system that persists the event stream data in such a way that you can apply checkpoints to let you restart reading from a specific point in the flow.

Ref. 19D3-E

Equipment such as pumps or turbines are now loaded with sensors providing a continuous stream of event data and measurements of many parameters in real or near-real time, and many new technologies and services are being developed to collect, transport, analyze, and store this flood of IoT data.

Ref. 2345-F

In cases where an upstream queuing system for messages was used, it was perhaps thought of only as a safety buffer to temporarily hold event data as it was ingested, serving as a short-term insurance against an

Ref. A36E-G

While queuing is useful as a safety message, with the right messaging technology, it can serve as so much more.

Ref. E00E-H

Being able to respond to life as it happens is a powerful advantage, and streaming systems can make that possible.

Ref. 5C8D-I

There has been a lot of excitement in recent years about low-latency in-memory frameworks, and understandably so.

Ref. 892B-J

An effective message-passing technology decouples the sources and consumers, which is a key to agility.

Ref. B6EF-K

The stream of medical test results data would not only include test outcomes, but also patient ID, test ID, and possibly equipment ID for the instrumentation used in the lab tests.

Ref. 5305-L

In the older style of working with streaming data, the data might have been single-purpose: read by the real-time application and then discarded. But with the new design of streaming architecture, multiple consumers might make use of this data right away, in addition to the real-time analytics program.

Ref. 79FC-M

Perhaps the most disruptive idea presented here is that streaming architecture should not be limited to specialized real-time applications. Instead, organizations benefit by adopting this streaming approach as an essential aspect of efficient, overall architecture.

Ref. AA87-N

the ways to build a streaming system to best advantage.

Ref. 838E-O

But suppose the analytics program has a temporary interruption or slow down.

Ref. E217-P

You want some insurance, so you also plan for a message queue to serve as a safety buffer as you ingest data en route to the Spark-based application. This type of design for a single-purpose data path for real-time

Ref. 4738-Q

The idea that you can build applications to draw real-time insights from data before it is persisted is in itself a big change from traditional ways of handling data.

Ref. A214-R

“Data Integration means making available all the data that an organization has to all the services and systems that need it.”1 Jay Kreps

Ref. F1B3-S

An important advantage of a system designed to use streaming data as part of overall data integration is the ability to change the system quickly in response to changing needs.

Ref. 15CE-T

Decoupling dependencies between data sources and data consumers is one key to gaining this flexibility,

Ref. 71E2-U

First of all, notice that the results output from the real-time application now goes to a message stream that is consumed by the dashboard rather than reaching the dashboard directly.

Ref. AD59-V

One nice feature of this style of design is that the anomaly detector can be added as an afterthought.

Ref. CA06-W

The messaging system also makes the raw data available to non–real time processes, such as those needed to produce a monthly report or to augment data prior to archiving in a database or search document.

Ref. 0509-X

you think about how to build a streaming system and which technologies to choose, keep in mind the capabilities required to support this design. Tools and technologies change, and new ones are developed, particularly in response to the growing interest in these approaches. But the fundamental requirements of an effective streaming architecture are more constant, so it’s important to first identify the basic needs of the system as you consider what technologies you will use.

Ref. 2384-Y

And for our design, the messaging software must be able to provide messages to multiple consumers.

Ref. BD06-Z

A replayable sequence with strong ordering preserved in the stream of events

Ref. E94F-A

Messaging systems like Kafka work very differently than older message-passing systems such as Apache ActiveMQ or RabbitMQ.

Ref. 4735-B

One big difference is that persistence was a high-cost, optional capability for older systems and typically decreased performance by as much as two orders of magnitude.

Ref. 703D-C

One big reason for the large discrepancy in performance is that Kafka and related systems do not support message-by-message acknowledgement.

Ref. ADB7-D

Furthermore, Kafka is focused specifically on message handling rather than providing data transformations or task scheduling.

Ref. 173E-E

You’ll sometimes hear someone say, “Hadoop has been replaced by Spark,” or wonder why Hadoop is still needed. Likely the reason they say this is that in-memory computational engines such as Spark are, in fact, taking the place of the batch-based computational framework of Hadoop known as MapReduce for many applications.

Ref. BD70-F

Confusion arises because people often make little distinction between Hadoop’s MapReduce and the larger ecosystem of Hadoop-related technologies. Projects in the Hadoop ecosystem include Apache Spark, Apache Storm, Apache Flink, ElasticSearch, Apache Solr, Apache Drill, Apache Mahout, and more. These projects are leaders among very large-scale, cost-effective distributed systems.

Ref. AE17-G

For example, in financial examples such as credit card transactions, unintentionally processing an event twice is bad.

Ref. 0270-H

The microservices idea is simple: larger systems should be built by decomposing their functions into relatively simple, single-purpose services that communicate via lightweight and simple techniques.

Ref. E5C2-I

Data formats should be future-proofed by using JSON, Avro, Protobuf, Thrift, or a similarly flexible system to communicate.

Ref. E6A2-J

synchronous call-and-response such as REST as well as asynchronous methods like message passing.

Ref. F99F-K

Also, Kafka, for instance, has a default maximum message size of only 1MB. MapR Streams caps message size at 2 GB.

Ref. 79B5-L

Maintaining independence of microservices by using lightweight communications between services via a uniform message-passing technology (such as Apache Kafka or MapR Streams) that is durable and high performance

Ref. 8795-M

The current best practice for meeting these requirements is to use a replayable persistent messaging system such as Kafka or MapR Streams throughout a streaming system.

Ref. E442-N

While it is possible to write very large messages to messaging systems, it is usually considered bad practice.

Ref. 25D8-O

This means that objects larger than a few megabytes should be passed by alternative methods, such as a distributed file system.

Ref. 14A8-P

This requirement for persistence combined with the required throughput, however, made conventional messaging systems infeasible in exactly the way that we saw in the previous chapter on micro-services.

Ref. 9931-Q

These design decisions mean that nonsequential reading or writing of files by a Kafka message broker is very, very rare, and that lets Kafka handle messages at very high speeds.

Ref. 8B90-R

The producer will buffer a number of messages before actually sending them to the Kafka broker. The degree to which messages are buffered before sending can be controlled by the producer by limiting either the number of messages to buffer or the time that messages are allowed to linger before being sent.

Ref. 8962-S

consumer ultimately reads these messages. In the simplest case, all messages sent to a topic are read by a single consumer in the order that the broker received them.

Ref. E51C-T

from a topic, but the ordering of messages in a topic will only be preserved within a single partition, and the number of threads cannot be larger than the number of partitions.

Ref. BEB8-U

The producer can control the assignment of messages to partitions directly by specifying a partition or indirectly by specifying a key whose hash determines the partition.

Ref. F136-V

Another major difference between Kafka and traditional messaging systems is that persistence of messages is unconditional.

Ref. 3C29-W

Moreover, messages are retained or discarded in the order they were received according to the retention policy of the topic, without any regard paid to whether particular consumers have consumed the message yet.

Ref. C84C-X

The one exception is that when old messages in a topic are about to be deleted, they can be compacted instead.

Ref. 803E-Y

With compaction, a message is retained if no later message has been received with the same key, but deleted otherwise.

Ref. 3BDB-Z

The purpose of compaction is to allow a topic to store updates to a key-value database without unbounded growth.

Ref. F6C7-A

When a single key has been updated many times, only the latest update matters, since any previous update would have been overwritten by the last update (at least).

Ref. F945-B

With compaction, a topic cannot grow much larger than the table being updated, and since all access to the topic is in time order, very simple methods can be used to store the topic.

Ref. A1A6-C

Within topics, messages are assigned to partitions either explicitly or implicitly via the hashcode of the key associated with the message.

Ref. F7C9-D

All messages in Kafka are sent asynchronously; that is, the messages are not actually sent over the network to the broker until some time after they are given to the KafkaProducer to send.

Ref. 8548-E

Instead, they are buffered until the buffer fills (buffer.size in the Kafka configuration), or until a specified time period has passed (linger.ms in the Kafka configuration). By default, the buffer size and timeout are set quite low, which can impair throughput, but tends to give fairly good latency.

Ref. 920E-F

Kafka, there are differing degrees of durability guarantees that are possible via different configurations.

Ref. E210-G

At the lowest level, a message only needs to be sent before being acknowledged. Slightly better than this, you can require that a message be acknowledged by at least one broker.

Ref. 3E88-H

Generally, we would strongly recommend starting with min.insync.replicas=2 and acks=all. The result is that all acknowledged messages will be on all of the up-to-date copies of a topic, and there will always be at least two such brokers for all acknowledged messages.

Ref. A540-I

For throughput-sensitive applications, increasing these parameters to 1 MB and 10 milliseconds, respectively, has a substantial impact on performance. For instance, in a small benchmark, 4 seconds were used in creating 1 million records without sending them, and it took 24 seconds to create and send them using the default parameters.

Ref. 09AC-J

On the other hand, when you are sending very few messages, it may improve latency a bit to explicitly call flush. flush also helps if you need to know

Ref. 8B1D-K

All messages go to all consumer groups who subscribe to a topic, but within a consumer group only one consumer handles each message.

Ref. B81B-L

This can become a serious problem because Kafka does not automatically move partitions to balance the amount of load or space on brokers. Likewise, if you add new nodes to a cluster, you have to migrate data to these new nodes manually. This can be a tricky and labor-intensive task since it is difficult to estimate which partitions are likely to grow or cause high loads. Mirroring Mirroring of data between Kafka clusters can be done using a utility called MirrorMaker. When it is run, MirrorMaker starts a number of threads that each subscribe to the topics found in a Kafka cluster. Multiple MirrorMaker processes can be run on different

Ref. 29D4-M

In particular, as the number of topics increases, the amount of random I/O that is imposed on the broker increases dramatically because each topic partition write is essentially a separate file append operation. This becomes more and more problematic as the number of partitions increases and is very difficult to fix without Kafka taking over the scheduling of I/O.

Ref. 085C-N

For instance, if you are an electrical utility doing smart metering, it

Ref. 278A-O

would be plausible to design a system in which each meter has a separate topic. The question of such a design is moot with Kafka, however, because having millions to hundreds of millions of topics is just not plausible.

Ref. A42D-P

Partition replicas in Kafka must each fit on a single machine and cannot be split across multiple machines.

Ref. ED3D-Q

As partitions grow, it is expected that some machine in your Kafka cluster will have the bad luck of having multiple large partitions assigned to it.

Ref. 7DB8-R

Kafka doesn’t have any automated mechanism for moving these partitions around, however, so you have to manage this yourself. Monitoring disk space, diagnosing which partitions are causing the problem, and then determining a good place to move the partition are all manual management tasks that can’t be ignored for a production Kafka cluster.

Ref. 603C-S

The requirement that all replicas of a partition must fit on all of the brokers holding it also makes Kafka unattractive for long-term archiving of data.

Ref. A278-T

The mirroring system used in Kafka is very simple — and for many enterprise applications, a bit too simple. By simply forwarding messages to the mirror cluster, the offsets in the source cluster become useless in the destination. This means that producers and consumers cannot fail over from one cluster to a mirror. This ability to fail over is often considered required table stakes in enterprise systems and not having it may completely preclude getting the benefits of Kafka.

Ref. D6FA-U

Kafka does, however, require significant amounts of care and watering to manually manage storage space and distribution.

Ref. 13D0-V

is not considered good practice to have more than about a thousand partitions on any single Kafka broker.

Ref. 11FD-W

These projects are groundbreaking in terms of what they make possible, and it is a natural evolution for new innovations to be implemented in emerging technology businesses before they are mature enough to be adopted by large

Ref. 3CA6-X

The result of this flexibility about which data goes into which topic and how many topics you have means that you can effectively sort data into sessions or device histories as it arrives.

Ref. DBD0-Y

MapR stream can contain an astounding number of topics — up to millions.

Ref. 1C59-Z

By collecting topics together, it’s convenient to apply policies such as time-to-live to the whole group of topics.

Ref. B659-A

Goal 1 When a customer uses a credit card to do a transaction, the vendor needs a fast response to the question, “Is it fraud?” Goal 2 We need to keep a history of fraud decisions the system has made. That history of decisions should be available to other applications and services within the organization as well as updating the database within the fraud system.

Ref. 237B-B

We don’t implement this step using streaming because a query-response style is a more appropriate match for user expectations in this situation.

Ref. 413A-C

However, we do publish the output of the detector, including the card and transaction information and the fraud decision, to a message stream for later processing. This stream is shown as a horizontal tube labeled “card activity” in Figure 6-2

Ref. C925-D

Handle huge numbers of topics (hundreds of thousands or more with high throughput)

Ref. A455-E

stakeholder. What data would be assigned to separate topics? How many topics? Which topics would be grouped together into streams? When would you use geo-distributed stream replication?

Ref. 6BC2-F