2 Apr 2025, Wed

Data Lakes & File Standards

Data Lake Platforms

  • Amazon S3: Object storage service for data lakes
  • Azure Data Lake Storage: Scalable data lake solution for big data analytics
  • Google Cloud Storage: Object storage for companies of all sizes
  • Databricks Delta Lake: Open-source storage layer for reliability in data lakes
  • Cloudera Data Platform: Enterprise data cloud for data management
  • Dremio: Data lake engine for analytics

File Formats

  • Parquet: Columnar storage file format
  • ORC (Optimized Row Columnar): Columnar storage format for Hadoop
  • Avro: Row-based data serialization system
  • CSV: Comma-separated values format
  • JSON: JavaScript Object Notation format
  • Protocol Buffers: Google’s language-neutral, platform-neutral extensible mechanism
  • Feather: Fast on-disk format for data frames
  • Arrow: Cross-language development platform for in-memory data

Table Formats

Apache Sedona: Cluster computing system for spatial data

Apache Iceberg: High-performance format for huge analytic datasets

Apache Hudi: Data lake platform with record-level updates and deletes

Delta Lake: Storage layer for ACID transactions on data lakes