“Scaling Machine Learning with Spark” Author Interview and Reading Notes

13 min readOct 31, 2023

📚 Some people may know that I host a DS/ML book club where we read one book per month. This month, we read ‘Scaling Machine Learning with Spark’ by Adi Polak. This book covers distributed ML systems, Spark/PySpark basics, MLflow, Spark MLlib, and deep learning frameworks PyTorch and TensorFlow.

In this blog post, I’d like to share an interview we had with the author that’s full of insights. I’ll also be sharing my personal reading notes for those who are curious about this book.

🔗 Book link: https://amzn.to/3scTtcn

🔗 Github repo: https://github.com/adipolak/ml-with-apache-spark

Chapter 1 Distributed Machine Learning Terminology and Concepts

The Stages of the Machine Learning Workflow:

collect and load/ingest data; explore and validate the data; clean/preprocess the data; extract features/perform feature engineering; split the data into a training set and a validation set; train and tune the model; evaluate the model with test data; deploy the model; monitor the model

Tools and Technologies in the Machine Learning Pipeline

Distributed Computing Models

general-purpose models: MapReduce: in practice, every task is split into multiple map and reduce functions, preserve data locality as much as possible; MPI: a message passing Interface that models a parallel program running on a distributed-memory system, standardize the communication, low-level standard; Barrier: a synchronization method, make a group of machines stop at a certain point and wait for the rest to finish; Shared memory.
dedicated distributed computing models: parameter server

Introduction to Distributed Systems Architecture:

network topologies: physical topologies describe how the computers are arranged and connected; logical topologies describe the way data flows in the system and how the computers exchange information over the network.
centralized system: all nodes depend on a single node to make decisions; decentralized systems: nodes are independent and make their own decisions, more tolerant of faults, can benefit from a multicloud/hybrid cloud architecture
interaction models: client/server; peer-to-peer; geo-distributed: aim to solve issues like data privacy and resource allocation.
communication in a distributed setting: asynchronous: the underlying mechanism is a queue; synchronous.

Introduction to Ensemble Methods

distributed training topologies: centralized ensemble learning: all requests from the distributed models go through main servers; decentralized decision trees; centralized, distributed training with parameter servers; centralized, distributed training in a P2P topology: gossip learning

The Challenges of Distributed Machine Learning Systems

performance: data parallelism: data split into shards or partitions, which are distributed among nodes; the same code/logic is also distributed to all machines; model parallelism: the model itself is split and distributed across machines; combining data parallelism and model parallelism
resource management
fault tolerance: many distributed computation frameworks have a built-in procedure — replicate the data and writing info to disk between stages for faster recovery; faulty agents/adversaries
privacy: avoid the restrictions of centralizing data, federated learning,
portability

Chapter 2 Introduction to Spark and PySpark

Apache Spark Architecture:

Driver program/Spark driver: execute the SparkSession, which encapsulates the SparkContext. Hold DAG scheduler, task scheduler, block manager, and everything needed to turn code into jobs.
Executor: a process launched for a particular Spark application on a worker node. A JVM process communicates with the cluster manger and receives tasks to execute.
Worker node: responsible for executing the work.
Cluster manager: orchestrate the distributed system

Intro to PySpark:

Python (code), JVM (query optimization, computation, distribution of tasks to clusters), _gateway (pass the Py4J application to a JVM Spark server)
Py4J: written in Python and Java that enables Python programs running in a Python interpreter to dynamically access Java in a JVM.

Apache Spark Software architecture:

API/libraries: work directly with Spark’s DataFrame or Dataset APIs; custom schema with StructType, save schema to a JSON file; MLlib, GraphFrames, Structured Streaming
DataFrames, Datasets, and RDDs are immutable storage
lazy execution

PySpark and Functional Programming:

Spark borrows many concepts from functional programming like anonymous function
Two main principles: immutability, disciplined state
The PySpark shell is responsible for linking the Python API to the Spark core and initializing the Spark Session and Spark Context

pandas DataFrames vs. Spark DataFrames

pandas is not built for scale, it doesn’t have the distributed Spark architecture. It also doesn’t adhere to functional programming principles: pandas DataFrames are mutable. It doesn’t support lazy evaluations.

Scikit-Learn vs. MLlibL:

scikit-learn: mutable datasets, data must fit in memory, model can be pickled to disk and reloaded via REST
MLlib: immutable datasets, distributed, supports parquet and snappy file formats

Chapter 3 Managing the Machine Learning Experiment Lifecycle with MLflow

Machine Learning Lifecycle

model development, model validation, wrapping model and testing in staging, model deployment and monitoring, archiving models
software development lifecycle (SDLC): planning, analysis, design, development, testing, implementation, maintenance
Machine Learning Lifecycle management requirement: reproducibility, code version control, data version control (lakeFS, DVC)

MLflow

Software components of the MLflow platform: storage (support for connecting multiple storage types), backend server (communicate info from database, UI, SDK, CLI to the rest of the components), frontend, API and CLI
Four main components: MLflow tracking, MLflow projects, MLflow models, MLflow model registry

MLflow Tracking:

record runs, mlflow.log_param/log_metric/log_artifact/log_model;
autolog logs the whole lifecycle
Logging the dataset path and version together with the model name and path during training, use log_param
set tags on the run

MLflow Projects

standard format for packaging code
a directory of a set of code files and a descriptor YAML file to specify its dependencies and how to run the code

MLflow models

The MLflow Models component is used for packaging machine learning models in multiple formats, called flavors.
MLflow Models provides several standard flavors that MLflow’s built-in deployment tools support, such as a python_function flavor that describes how to run the model as a Python function.

MLflow Model Registry

centralized storage for models
transitioning between stages: MLflowClient API transition_model_version_stage allows us to update the model stage and invoke the desired CI/CD scripts

Using MLflow at Scale

Two components for storage: a backend store for information about experiments and an artifact store for storing the models themselves and any related files, objects, model summaries etc.
You can store experiment-related entities in a database like PostgresSQL or SQLite
You can run MLflow on local machine and tracking server and stores reside remotely

Chapter 4 Data Ingestion, Preprocessing, and Descriptive Statistics

Data Ingestion with Spark

work with batch and streaming data with the Spark batch API and either the legacy DStream API or the new Structured Streaming API.
spark.read.format
work with image files: MLlib converts compressed images into the OpenCV data format. It’s recommended to use the more efficient binary format.

Preprocessing Data

structured data, semi-structured data, unstructured
MLlib data types: vector (VectorUDT: DesnseVector, SparseVector), matrix (MatrixUDF)
MLlib Transformers: pyspark.ml.feature, text data transformers (Tokenizer, RegexTokenizer, HashingTF, NGram, StopWordsRemover), categorical feature transformers (StringIndexer, IndexToString, OneHotEncoder, VectorIndexer), continuous numerical transformers (Binarizer, Bucketizer, MaxAbsScaler, MinMaxScaler, Noramlizer, Quantile Discretizer, RobustScaler, StandardScaler), and others (DCT, Elementwise product, Imputer, Interaction, PCA, Polynomial Expansion, SQL Transformer, Vector Assembler)
image data: extract labels, transform labels to indices, extract image size (Pillow, @pandas_udf — a pandas user-defined function that can run in a distributed fashion on our Spark executor, optimized with Apache Arrow and faster for grouped operations)
save the data
avoid the small files problem: a small files is any file that’s significantly smaller than the storage block size; repartition: entirely new partitions, shuffling the data over the network with the goal of distributing it evenly over the specified number of partitions; coalesce: detects the existing partitions and then shuffles only the necessary data, only reduce the number of partitions

Descriptive Statistics

from pyspark.ml.stat import Summarizer
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import skewness
from pyspark.ml.stat import Correlation

Chapter 5 Feature Engineering

handling missing data, exacting features from text, categorical encoding, feature scaling/data normalization

MLlib Featurization Tools:

extractors: TF-IDF, Word2Vec, Count Vectorizer, Feature Hasher
selectors: vector slider, RFomula, ChiSqSelector, Univariate Feature Selector

The Image Featurization Process:

RGB image vs Grayscale
defining image boundaries using image gradients: Laplace highlights areas of rapid intensity change

Extracting Features with Spark APIs

Spark pandas UDF types: SCALAR, SCALAR_ITER, GROUPED_MAP, GROUPED_AGG
applyInPandas and mapInPandas operate over a grouped Spark DataFrame and return a new DataFrame
Compared to pandas DataFrame, Spark DataFrame supports parallel execution, lazy operation, and is immutable.

The Text Featurization Process

bag-of-words, tf-idf, n-gram, topic extraction

Feature stores

2017 Uber introduced feature stores with its ML platform Michelangelo
store features in optimized, relatively fast storage, with comprehensive metadata to allow fast access and low-latency querying by the ml algorithm.
help with caching, deal with data types, and bridge multiple platforms

Chapter 6 Training Models with Spark MLlib

Supervised ML

Classification:

binary, multiclass, multilabel
APIs: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier, MultilayerPerceptronClassifier, LinearSVC, OneVsRest, NaiveBayes, FMClassifier
Multilabel classifier: not supported by MLlib; use PyTorch/Tensorflow; train binary classifiers for each label
imbalanced class labels: filter the more representative classes and sample them to downsize the number of entries; use ensemble algorithms based on decision trees such as GBTClassifier, GBTRegressor, RandomForestClassifier and set featureSubsetStrategy to auto.

Regression:

simple, multiple, multivariate
FeatureHasher: to speed up the process of indexing and hashing; of type SparseVector
UnivariateFeatureSelector: select features automatically
AFTSurvivalRegression, DecisionTreeRegressor, GBTRegressor, FMRegression

Recommendation Systems

Content-based, Collaborative filtering, Neural networks
ALS for collaborative filtering

Unsupervised ML

Frequent Pattern Mining
Clustering: LDA, GaussianMixture, KMeans, BisectionKMeans, PowerIterationClustering (Lin and Cohen algorithm); compare the likelihood scores between models.

Evaluating

Supervised Evaluators: confusion matrix; BinaryClassificationEvaluator, MulticlassClassificiationEvaluator (hammingLoss); MultilabelClassificationEvaluator (microPrecision, microRecall), RegressionEvaluator (mse, rmse), RankingEvaluator (RankingMetrics API)
Unsupervised Evaluators: ClusteringEvaluator computes Silhouette measure.

Hyperparameters and Tuning Experiments

Building a parameter grid: ParamGridBuilder()
by default, TrainValidationSplit uses 75% data for training and 25% testing
CrossValidator

Machine Learning Pipelines:

Transformer and Estimator
Constructing a pipeline with a Pipeline object
Persistence: .write().save(path); you can also export model to a portable format like ONNX

Chapter 7 Bridging Spark and Deep Learning Frameworks

Two Clusters Approach

One dedicated cluster for running all Spark workloads. Save the data into a distributed filesystem or object store and use in the second cluster — the dedicated DL cluster.

Implementing a Dedicated Data Access Layer (DAL)

A DAL is responsible for the accessibility of the storage.
Features of a DAL: Scalable and support distributed systems; discoverability with a dedicated data catalog like Hive Metastore; support columnar formats; allow for row filtering; data versioning

Petastorm

Open source data access library that allows us to train and evaluate deep learning models directly using multiterabyte datasets in Apache Parquet format.
support efficient implementations of row filtering, data sharding, shuffling, access a subset of fields, and handle time series data
Two options to use Petastorm: 1) leverage it as a converter or translator and keep the data in strict Parquet format; 2) Integrate the Petastorm format with the Aparche Parquet store, save data into a dedicated Petastorm dataset
SparkDatasetConverter: read data from cache or persist in Parquet. It caches intermediate files and cleans out the cache. Input: parquet_row_group_size_bytes, compression_codec, dtype. We can define additional preprocessing functions using TransformSpec.
PetaStorm as a Parquet Store: define a new schema with UnischemaField; support various codec types for images, scalar data, etc.

Project Hydrogen

improve Apache Spark’s support for deep learning/neural network distributed training
Barrier Execution Mode: create gates or barriers between sets of operations and make the operations across barriers sequential
Accelerator-Aware Scheduling: validate that the distributed system scheduler, responsible for managing the operations and available resources, is aware of the availability of GPUs for hardware acceleration

Horovod Estimator API

Horovod’s goal is to allow for single-GPU training to be distributed across multiple GPUs
The Horovod Estimator hides the complexity of gluing Spark DataFrames to a deep learning training script.
Horovod helps us configure GPUs and define a distributed training strategy.

Chapter 8 TensorFlow Distributed Machine Learning Approach

Tensorflow Basics:

tf.Tensor (immutable multidimensional array), tf.Variable (mutable), tf.Operation (a node in a graph that performs some computation), tf.function (annotation, compile a function into a callable TF graph), tf.Graph (graph executions — operations are added to tf.Graph to be executed later), tf.Module, model.compile (configure a model for training, specifying the loss function, graph optimizers and metrics)
Neural Network: tf.keras.layers.Layer, tf.Keras.Model
Tensorflow Cluster Process Roles and Responsibilities: worker, parameter server (PS — keep track of variables’ values and state), chief (similar to work but with additional responsibilities related to cluster health), evaluator

Loading Parquet Data into a TensorFlow DataSet

tf.data.Dataset. A TF dataset is an abstraction of the actual data.
TF doesn’t support loading Parquet out of the box.
Petastorm: use the make_petastorm_dataset function that creates a tf.data.Dataset instance, using an instance of petastorm.reader.Reader.
Use make_batch_reader to create the reader: normalize the dataset URL, locate filesystem paths, analyze the Parquet metadata, validate cache format, determine reader pool type, return the reader instance (encapsulate ArrowReaderWorker encapsulate pyarrow)

TensorFlow’s Distributed Machine Learning Strategies (tf.distribute.Strategy)

ParameterServerStrategy: oldest approach, each machine takes on the role of a worker or a parameter server and TF divides the tasks into worker tasks and parameter server tasks. Rules: variables are stored on parameter servers and are read/updated by worker during training, each variable is stored on a single parameter server, workers perform tasks independently and communicate only with parameter servers with asynchronous remote procedure calls (RPCs), there may be one or multiple parameter servers.
CentralStorageStrategy (one machine, multiple processors): one machine with multiple CPUs and GPUs where CPUs hold variables and GPUs execute operations.
MirroredStrategy (one machine, multiple processors, local copy): every processor that holds a replica of the training operations’ logic also holds its own local copy of every variable. Use an all-reduce algorithm to ensure identical updates across processors.
MultiWorkerMirroredStrategy (multiple machines, synchronous): synchronous training distributed across machines, each of which can have more than one processor. Each variable is replicated and synced across machines and processors. Good when there is good connectivity between machines.
TPUStrategy: for synchronous training on TPUs and TPU pods, supports communication between individual processors within a machine.

Training APIs

Keras API
Custom Training Loop: TF’s custom training loop API provides granular control over training and evaluating loops built from scratch.
Estimator API: This is an old API that should not be used for new code.

Chapter 9 PyTorch Distributed Machine Learning Approach

PyTorch Overview

Computation Graph: dynamic; the graph is built at run time and execution starts before the graph is complete; automatic differentiation engine calculates the gradients automatically before the backward computation even starts
PyTorch Mechanics and Concepts: torch.Tensor, torch.autograd, AutogradMeta, Variable, torch.layout, torch.mm, torch.utils.data.DataLoader, torch.optim.Optimizer

PyTorch Distributed Strategies for Training Models

PyTorch’s Distributed Approach: torch.distributed, three main components: 1) distributed data-parallel training DDP: data parallelism at the application module level; 2) RPC-based distributed training: supports training process at a higher level and provides mechanisms to enable remote communication between machines; 3) collective communication c10d: expand the communication structure and support sending tensors across processes within a group
Distributed Data-Parallel Training: divides communication into buckets, which holds the indices of where each value in the input belongs and can consist of multiple layers or one layer
RPC-Based Distributed Training: enables a program on the local machine to start a program on another machine; enables the implementation of a parameter server, model parallelism, and pipeline parallelism; APIs can be grouped into four categories: remote execution (init_rpc), remote references RRefs (a distributed shared pointer to an object), distributed autograd (stitch together the local autograd engines on all the machines so the backward pass can be run in a distributed fashion), distributed optimizer
Communication Topologies: collective communication is defined as any communication that involves a group of processes/machines; all_reduce, all_gather, peer-to-peer communication APIs; guidelines for which backend to use: rule of thumb use NCCL for distributed GPU training and use Gloo for distributed CPU training; collective communication APIs: scatter, gather, reduce, all_reduce, all_gather, broadcast; peer-to-peer communication: send, recv, isend, irecv

Loading Data with PyTorch and Petastorm:

open a Petastorm reader on the parquet file directory, create a PyTorch DataLoader based on this reader
Limitations: NumPy strings, arrays of strings, object arrays, and object classes are not supported, need to process the data first

Troubleshooting Guidance

mismatched data types
straggling workers

PyTorch vs. TensorFlow

viz and debugging: Tensorflow has better tools
computation graph: dynamic vs static
programming limitations: PyTorch less generic, Tensorflow has lots of boilerplate code
language support for model loading: Tensorlofw support modeling loading in other languages like Java and C++
supported deployment options: torchScript vs TensorFlow Serving, Flask web server, mobile.

Chapter 10 Deployment Patterns for Machine Learning Models

Deployment Patterns

Batch Prediction: Pros — easy to implement, scale easily to large data, provide results quickly, Cons — harder to scale with complex inputs, might not be able to compute all possible outputs in the time available, users might get outdated or stale predictions
Model-in-Service: the model is packaged up with the client-facing app and deployed to a web server, server loads the model and make prediction in real time. Issues: web server backend may be written in a different language, coupling the app and the model means they share the same release schedule, the model may compete for resources with other functions, web server hardware not optimized for models, hard to scale
Model-as-a-Service: deploy the model itself as a service to avoid coupling with the backend server hardware and possibly conflicting scaling requirements. Drawbacks: add latency to every call, add the complexity of deploying a new service and managing the interaction with it, need to monitor this service
which Pattern to Use: batch processing method has highest throughput and highest latency, real-time methods have lowest throughput and lowest latency; consider the level of guarantees about missing the deadline: hard, firm, soft. Real-time vs. near real-time

Production Software Requirements:

model application deployment, model package deployment, dependency management, model runtime environment, REST APIs, Performance optimization, concurrency, model distillation/compression, Cashing layer, Horizontal scaling, managed options

Monitoring ML in Production

Data Drift: instantaneous drift, gradual drift, periodic drift, temporary drift,
Model Drift, Concept Drift
Distributional Domain Shift
What Metrics to Monitor? model metrics, business metrics, model predictions vs. actual behavior, hardware/networking metrics
How to Measure Change? define a reference, measure the reference against fresh metrics values — rule-based distance metrics, D1 distance, Kolmogorov-Smirnov statistics, KL divergence

The Production Feedback Loop

two data pipelines, one compares recent data to the reference data to detect data drift, and one compares model’s predictions with the actual results to detect model drift

Deploying with MLlib

MLlib’s model format holds metadata as well as the actual data of the model, all the info can help you load the model in a different environment
Structured Streaming: set up the pipeline with schema, stream reader, ml model; can capture predictions using two data streams and run a stream-stream join to join two streaming DataFrames, which is hard to do efficiently, think of it as microbatch joins

Deploying with MLflow

Two deployment options: as a microservice and as a UDF in a Spark flow
Define an ML wrapper: mlflow.pyfunc.PyFuncModel wraps the model and its metadata (the MLmodel file); the interface has three functions: init , load_context, predict (must implement this function every time)
Deploying the model as a microservice: mlflow.deployments.BaseDeploymentClient exposes APIs that enable deployment to custom serving tools
Loading model as a Spark UDF: using the spark_udf function to combine the UDF with the Spark DataFrame and create a new Spark DataFrame

Develop Your System Iteratively

Crawl: manual deployment operation
Walk: add automated testing and other automation
Run: add more testing, fine-tune the alerts, capture any changes in data, monitor throughput and results
Fly: connect the feedback loop of alerts and capturing data drift and changes in production together with triggering a new training process

“Scaling Machine Learning with Spark” Author Interview and Reading Notes

Written by Sophia Yang, Ph.D.