What is batch processing?
Big data defines a large amount of data that often overwhelm companies and their operations. Its constant influx is necessary to gain insights about markets and trends without which developing strategies become difficult. The resultant data can be either finite or infinite. However, it requires proper analysis as the pieces of information that come in are primarily from raw sources. The process to extract information from finite amount of data is called batch data processing or batch processing.
What is Flink and why should we use it?
Flink is a framework and distributed processing engine for batch and stream data processing. Its structure enables it to process a finite amount of data and infinite streams of data.
Flink has several advantages like-
- It provides a high throughput, low latency streaming engine
- It supports time-based event processing and state management
- The system is fault-tolerant and guarantees exactly-once processing
- Flink provides data source/sink connectors to read data from and write data to external systems like Kafka, HDFS, Cassandra etc.
Overview of batch processing
Batch processing applications generally have a straightforward structure. First, we read data from an external data source, such as distributed file system or database. Flink application processes the data and the result is written back to external data sink like distributed file system or database.
DataSet API
Dataset is a handle for processing distributed data in the Flink cluster. It stores elements of the same type and it is immutable.
Read data
Flink provides several methods to read data into DataSet. Here are some of the commonly used ones:
Method | Description |
readCsvFile | Read a file in CSV file format |
readTextFile | Read raw text file |
fromElements/
fromCollection |
Create DataSet from a collection of records |
readFile (inputFormat, file) | Read file using a custom File InputFormat type |
createInput (inputFormat) | Read data from a custom data source |
Transform data
Flink have several methods to transform elements within DataSet. Here are some of the commonly used ones:
Method | Description |
map | Convert every input record to object |
filter | Filter out records based on some condition |
flatMap | Transform record into zero or more elements |
groupBy | Group records based on key from the element |
Flink also provides methods to be applied on multiple DataSets. Here are some of the commonly used ones:
Method | Description |
Join | Join two datasets based on common key from those datasets |
outerJoin | Outer join for two datasets |
Cross | Cross-product of two datasets |
Write data
Flink has several methods to write DataSet result to external sink.
Method | Description |
writeAsText | Write result as text |
writeAsCsv | Write result in CSV format |
print/printToErr | Write data to standard output or standard error |
write | Write result to a file using common FileOutputFormat |
output | Write result to a custom output using OutputFormat |
Collect | Convert result into a List datatype |
Processing workflow
Imagine we want to process songs.csv file where each record has song id, title, singers’ fields and we want to get the list of songs sung by ‘Taylor Swift’.
songs.csv: https://gist.github.com/sagargangurde/78c02b1370fb1cdeb8b3a4ec089ff4be
FilterSongs.java: https://gist.github.com/sagargangurde/d2ebbe257837c6909d0a824ec4d6df4e
First, we read songs.csv using readCsvFile() method which will return the DataSet containing all songs. As its following step, we use map () method to convert each DataSet record to a song object. Then we apply filter () method on each song object to filter songs sung by ‘Taylor Swift’ and write the result in text format using writeAsText() method.
Conclusion
There are multiple tools that address problems related to batch processing. But Flink DataSet API has specific pros that make developers choose it more often than others. It is easy to work with and the wide range of applications make it an attractive option. Its flexibility and affordability are also reasons that encourage developers to go for this option.