Machine Learning as a Flow Cont-ed: Kubeflow vs. Metaflow
Updated: Jul 17
After having both tools both in our dev and production, here is my comparative analysis of Kubeflow (2018, Google) and Metaflow (2019, Netflix). At a high level, both Flows facilitate ML Platform: Integration of ML components with storage components, support experiments logging, and reproducibility of ML experiments.
Here are a few references to emphasize why we need MLOps: 3-min: https://youtu.be/sdbBcPuvw40 "Spell: Next Generation Machine Learning Platform", 33-min: https://youtu.be/lu5zHvpQeSI "Managing ML in Production with Kubeflow and DevOps - David Aronchick, Microsoft"
10-min Metaflow read: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4
If I needed to explain in one word why one would need MLOps, it would be: "SCALE" Do you want to add "SCALE" to your ML?
The different teams want different things from the Machine Learning Platform. Data Science/ML Engineers:
working toward improving models' accuracies
need help scaling their experiments such as compute and processing data loads
does not aim directly to improve models' accuracies or win Kaggle competition
does scale and facilitates ML teams that want to win Kaggle competitions.
Below is my comparative analysis.
Distributed computation Both offer full support for pipelines and components running in parallel.
Cluster I am a big big fan of Kubeflow design that provides Kubernetes 'under the hood'. That solves many problems such as (1) deploy any cloud since k8s is an open-source (2) DevOps/MLOps familiarity and willingness to learn k8s (3) a significant amount of tools available for k8s monitoring from third parties
Communication between pipeline components (K) doesn't support communication between components besides exchanging files. In my opinion that is the highest selling point of Metaflow.
Goal ... highly subjectively (M) provide Python-based Machine Learning workflows for running ML experiments at scale (K) the same as M but also runs Kubernetes "under the hood", which turns (K) into a powerful ML cluster.
Cloud (M) limited to AWS Batch and S3 (K) unlimited deployment GCP, Azure, anything that runs k8s.
Serving via external tooling: (K) - AI Hub and TFX at Google Cloud Platform, (M) - Sagemaker/AWS for (M)
Horizontal scaling (K) is a Kubernetes cluster with powerful support for adding nodes and deployment monitoring TFX or custom (M) AWS ECS
Costs Subjectively, but I tend to think that similar computational loads on Google cloud will cost less compared to AWS.
What you can get when your Machine Learning Platform build off Kubeflow or Metaflow:
Both Kubeflow and Metaflow support the most common ML/DS common scenarios by managing code, data, and dependencies for each ML/DS experiment. Following common ML/DS scenarios in https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4, (K) and (M) at a high level deliver: (K) and (M) Collaboration: debug somebody's error, pull up somebody's failed run state in your sandbox/laptop as-is. (M) Resuming a run: A run failed (or was stopped intentionally). You fixed the error in your code. You wish you could restart the workflow from where it failed/stopped. (K) and (M) Hybrid runs: run one step of your workflow locally (maybe the data load step since the dataset is in your downloads folder) but want to run another compute-intensive step (the model training) on the cloud. (K) and (M) Inspecting run metadata: Three data scientists have been tuning the hyperparameters to get better accuracy on the same model. Now you want to analyze all their training runs and pick the best performing hyperparameter set. (K) and (M) Multiple versions of the same package: one is free to choose version, language, packages for any step of one's workflow.
(K) Not locking a particular cloud provider: abstracts out open-source Kubernetes and Airflow/other Orchestration. However, using TensorFlow Extended (TFX) will lock into Apache Beam, provided as Dataflow at Google Cloud Platform. (M) currently locks in AWS Sagemaker/S3/Batch
Running heavy parallel-distributed computational loads in a stable way: example of unstable cluster management when some 'memory thirsty' process causes other processes to abort due to lack of memory resources. Another example is the utilization of resources when GPU components from one pipeline will be able to run in parallel with CPU components from another pipeline. (K) utilizes Kubernetes computational-distributed load, with one potential solution to have memory-GPU threads requirements declared explicitly in pipeline's components.
Integration with various tools and cross cloud providers: (K) runs containers consequently all API service calls can be run as one pipeline as soon as they can be isolated into docker container