Apache Flume Introduction

Summary -

In this topic, we described about the below sections -

Introduction

What is Apache Flume?

Apache Flume is tool/Service for efficiently collecting, combining and moving large amounts of log data. Flume is a highly distributed, reliable and available tool/service.

Flume is an intermediate tool/service for data collection from external data sources and sends to centralized data storage like Hadoop HDFS or HBase.

Why Flume?

Flume normally used for log data which is also called as streaming data. Flume has a simple and flexible construction based on streaming data flows.

Flume is a strong and fault tolerant with tunable reliability mechanisms and also with many fail recovery mechanisms. Flume is designed to copy streaming data from various sources to HDFS.

Flume uses a simple data model that allows for online analytic application. In general, almost every website keeps generating logs continuously. To transport these logs to HDFS, Flume is the mostly used. Flume takes data from several sources like Avro, Syslog’s etc,.

Flume Components?

Apache Flume is collection of 6 important components and the components are.

Events
Sources
Sinks
Channels
Interceptors
Agents

Flume Advantages?

The basic advantage is Flume used for log data. The other advantages of Flume are -

Flume is reliable. i.e,. Flume guaranteed the event won’t be lost, If an event is introduced into the Flume event processing framework.
Flume is scalable. i.e,. Flume increases the scalability by increasing the process agent which also called as Flume agents.
Flume is manageable, customizable and is of high performance.
Declarative configuration by routing, processing and transforming etc, when the event is introduced. The event flow can be dynamically configured.
Contextual routing through the dynamic configuration.
Flume is feature rich and fully extensible.

Flume Features?

The main features are -

As discussed earlier, Flume has flexible design based on streaming data flows.
Streaming data from multiple sources.
Flume is faster and used to collect the data immediately from different webservers to Hadoop.
Insulate Systems by increasing the buffer storage .
Guarantee data delivery (Flume NG) uses the channel-based transactions to guarantee reliable delivery.
Flume offers different levels of reliability which includes 'best-effort delivery' and an 'end-to-end delivery'.
Scale horizontally to ingest additional data volumes and new data streams when required.
Flume supports large number of sources and destinations including various types of sources and destinations.
Flume carries data between sources and centralized store and the data gathering can either be scheduled or event driven.
Possible Centralized store normally include HDFS and Hbase.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.