2 Apr 2025, Wed

Distributed Data Processing

Batch Processing Frameworks

  • Apache Hadoop: Framework for distributed storage and processing
  • Apache Spark: Unified analytics engine for large-scale data processing
  • Apache Hive: Data warehouse software for reading, writing, and managing data
  • Presto/Trino: Distributed SQL query engine for big data
  • Apache Pig: Platform for analyzing large datasets
  • Databricks: Unified analytics platform built on Spark

Stream Processing Frameworks

Spark Streaming: Real-time data processing with Spark

Apache Flink: Stream and batch processing framework

Apache Beam: Unified model for batch and streaming data processing

Apache Storm: Distributed real-time computation system

Apache Samza: Distributed stream processing framework

Apache Pulsar: Distributed messaging and streaming platform