Editorial Summary :

Google suggests 3 options for streaming data ingestion . NIFI processes flow files, which are not shared or replicated between cluster nodes . Monitoring of consistency requires gathering counts from Kafka topics and comparison with Hive tables or HDFS files . Flink has a built-in auto-retry based on checkpointing . It is fast, scalable, gives exactly once processing for granted, works well with CI/CD pipelines. It is easy to maintain Flink jobs with e.g., Ververica Platform, Kubernetes or good old Yarn . Thousands files per minute will kill HDFS and Hive performance . Flink SQL doesn’t solve the problem. Set the checkpoint interval to 60 seconds and it’s done! Not really, because Flink creates new files on checkpoints. That way it can guarantee processing exactly once . In a pessimistic scenario you can have “N” files on a checkpoint for every topic where ‘N’ is sink parallelism. Thousands files will kill the performance of Hive and HDFS .

Key Highlights :

  • Google suggests 3 options for streaming data ingestion .
  • NIFI is a tool of choice for data ingestion, but it does not guarantee at least once processing .
  • Monitoring of consistency requires gathering counts from Kafka topics and comparison with Hive tables .
  • Thousands files per minute will kill HDFS and Hive performance .
  • Flink SQL doesn’t solve the problem .
  • You can try to by-pass the problem and write to the Hive managed table and enable compaction .

The editorial is based on the content sourced from medium.com

Read the full article.

Similar Posts