Apache Flume Architecture

Summary -

In this topic, we described about the Architecture in detail.

As discussed earlier, Apache Flume is an intermediate tool/service for data collection from external data sources and sends to centralized data storage. The External sources are data generation points where the data gets generated.

The examples for the data generation points mostly social networking sites like Facebook, twitter, LinkedIn etc,. The centralized store is HDFS or HBase.

So Apache Flume acts as a data collector by collecting data from data generators like Facebook, twitter, LinkedIn etc, grouped and send to centralized store lik HDFS or HBase.

The below diagram explains the Flume architecture in detail.

Let’s discuss in detail about each entity in the architecture.

Event -

Event is a byte payload with optional string header. The event represents the unit of data that Flume is transporting from data generators to centralized storage.

The unit of data is typically a single log entry. Event contains a payload of byte array that is to be transported. Event has the structure shown like below.

Flume is a combination of Agent and Data collection.

Agent -

A Flume agent is any physical Java virtual machine (JVM) running in Flume. Agent hosts the components through which events flow from an external source to the next destination (hop).

Agent is an independent process. Agent is used to receives the information from Data Generators like Facebook, Twitter, LinkedIn etc,.

Agent has ability to receive, store and forward events to the next-hop destination. Flume may have more than one agent. Agent is a collection of sources, sinks and channels shown in the below diagram.

Flume agent process starts from source and ends with sink. Let’s discuss about Source, channel and sink in detail.

Source -

A source is an Agent component which receives data from the data generators and transfers it to one or more channels. Source is an interface implementation entity or a component.

The entity is starting point where the data enters into the Flume. The source consumes the events delivered by external sources like web server. Sources can collect the data in two ways which it waits for the data delivered to them or actively elect for the data.

The external source sends the data in format which is recognized by the target Flume source. Different types of sources supported by Flume and each sources receives events from specified data generators.

The examples for sources are Avro source, Thrift source, twitter 1% source etc. When a Flume source receives an event, it stores it into one or more channels.

Channel -

The Channel is a store which stores the events received from Source. The channel buffers the events received till those are consumed by sinks. Channel acts as a bridge or conduit between the sources and the sink.

Channel is a passive or transient store for events. Sources ingest events into the channel and the sinks drain the channel. Channels can work with any number of sources and sinks.

Channels play an important role in ensuring durability of the flows. Examples foe channels are JDBC, File system etc,.

Sink -

A Sink is an entity which delivers the data to the destination. A Sink receives the events from Channel and delivers it to the centralized store. A range of sinks allow data to be streamed to a range of destinations.

The destination of the sink might be another agent or the central stores. An interface implementation can remove events from channel and transmit to the destination or next agent.

Sinks which transmits the event to destination called as terminal sinks. Sinks which transmits the event to another agent called as Avro Sink. The examples are HDFS Sink, Avro Sink.

Note! A flume agent can have multiple sources, sinks and channels.

Flume Master -

The entire Flume flow Controlling done by Flume Master. Flume Master is a separate service with knowledge of all the physical and logical nodes in a Flume installation.

The Master assigns configurations to logical nodes. The Master is responsible for communicating configuration updates by the user to all logical nodes. All the logical nodes periodically contact the master to share monitoring information and check for updates to their configuration.