All posts
Clickstream Analyzer 6 min read

A Cloud-Native Clickstream Analytics Stack: Flink, Spark, Iceberg & Ray on Kubernetes

An open-source reference architecture for e-commerce clickstream analytics using Apache Flink for real-time streaming, Spark for batch ETL, Iceberg for the table format, and Ray for ML — all on Kubernetes.

Modern e-commerce analytics demands two things that have traditionally lived in separate systems: real-time insights for fraud detection and live recommendations, and batch-processed historical data for reporting and model training. The Clickstream Analyzer is an open-source reference architecture that runs both workloads on a unified Kubernetes platform, using best-in-class open-source components for each layer.

Architecture Overview

The platform layers five components, each deployed as a Kubernetes service within a dedicated namespace. MinIO provides S3-compatible object storage for the data lake. Nessie adds Git-like version control to Apache Iceberg tables — branches, commits, and time-travel queries across the table catalog. Apache Spark handles batch ETL and historical analytics. Apache Flink runs the real-time streaming layer. Ray powers distributed ML training and inference.

Real-Time Layer: Apache Flink

Flink's JobManager/TaskManager cluster handles event stream processing for two primary use cases. Fraud detection runs continuously, evaluating clickstream events against behavioral patterns and scoring each transaction in near-real-time. Real-time recommendation updates consume user behavior events and produce updated recommendation scores as users browse — meaning product recommendations reflect what a user just viewed, not what they viewed last session.

Batch Layer: Apache Spark

The Spark cluster (one master, two workers) handles nightly and on-demand batch workloads: ETL transformations from raw event logs to cleaned Iceberg tables, user and product attribute enrichment, and historical analytics aggregations. Spark writes directly to Iceberg tables registered in the Nessie catalog, enabling schema evolution and time-travel queries against historical snapshots.

Table Format: Apache Iceberg + Nessie

Apache Iceberg solves the classic data lake problem: files accumulate, schema changes break downstream queries, and there's no way to read last week's data reliably. Iceberg adds ACID transactions, schema evolution, and time-travel to object storage. Nessie adds version control on top — you can create a branch, run a Spark transformation on it, inspect the result, and merge it to main when you're confident it's correct. This workflow eliminates the 'corrupted production table' incident that affects teams using raw Parquet files.

ML Layer: Ray

Ray provides a distributed execution framework for ML workloads that don't fit neatly into Spark or Flink. Recommendation model training, hyperparameter tuning, and online serving all run on the Ray cluster, which auto-scales worker nodes based on queue depth. KubeRay, the Kubernetes operator for Ray, manages lifecycle and scaling within the cluster.

Deploying the Platform

bash
git clone https://github.com/BorisBesky/ecommerce-platform.git
cd ecommerce-platform/k8s

# Check prerequisites (kubectl, helm, cluster resources)
./check-prerequisites.sh

# Deploy all components
./deploy-all.sh

# Validate deployment
./validate-deployment.sh

# Submit sample jobs
./submit-sample-jobs.sh

Minimum Requirements

  • Kubernetes v1.24+ (local: Docker Desktop, minikube, or kind; cloud: EKS, GKE, AKS)
  • 8 CPU cores and 16GB RAM minimum for development
  • kubectl and Helm v3.0+ installed
  • Recommended production: 5+ nodes, 8 CPU / 32GB RAM each, 500GB+ NVMe storage
The full source including Kubernetes manifests, Spark/Flink/Ray applications, and deployment scripts is available at github.com/BorisBesky/ecommerce-platform.