Data Transfer in Hadoop
Big Data is basically a massive collection of huge datasets that can't be handled by traditional computing methods. When you dig into it, Big Data can reveal some really useful insights. Hadoop is a open-source framework that helps store and process Big Data in a distributed way, using a bunch of computers and simple programming models.
Streaming or Log Data
Most of the data we want to analyze comes from sources like app servers, social media, cloud platforms, or enterprise systems. This data usually comes in the form of log files or events, which are basically records of actions or status updates. For example, when someone visits your website, that visit gets recorded in a log file. Analyzing log data helps you understand how an app performs or how users behave—insight that powers better decisions.
HDFS put
Command
In Hadoop, the simplest way to move data into the system is using the hadoop fs -put command. You just run something like:
hadoop fs -put /local-path-to-log.txt /hdfs/path/
This copies a file from your machine into HDFS. It’s simple and works fine when you're dealing with occasional files that are already ready to go.
Problem with put Command
Using the put
command has some downsides when working with constantly changing log data:
- One file at a time — It only handles individual files, but log data is generated continuously and fast.
- Needs ready-made files — You must have the data saved and organized before uploading, which is hard when logs are being written in real time.
- Too slow for real-time — Since you’re uploading packaged files manually, real-time analysis becomes difficult.
These issues make it unusable for streaming or live log ingestion.
Problem with HDFS
HDFS isn’t designed for writing open files that are constantly updated. A file’s entry in HDFS stays at length zero until it's closed. So if you're writing data and the network goes down or the job crashes before closing the file, all that data is lost. Unlike your regular OS, where you can open a file and read its current contents even while it’s being written, HDFS won’t allow that.
To handle real-time log data reliably, you'll need a smarter system that avoids these pitfalls.
Flume Features?
When it comes to sending streaming data like log files and events from different sources to HDFS, we've a bunch of tools we can use -
- Facebook’s Scribe: A popular log aggregation service that collects logs from many servers and streams them centrally, even across massive clusters.
- Apache Kafka: Acts as a message broker. It’s great for high-speed, low-latency streaming where you can publish logs/events and have multiple consumers read them.
- Apache Flume: An open-source tool made for collecting, aggregating, and transporting large amounts of streaming data (logs/events) into things like HDFS. It’s reliable, distributed, and configurable—purpose-built to fill the gap left by put and HDFS limitations.