Machine Learning as a Flow Cont-ed: Kubeflow vs. Metaflow
Updated: 4 days ago
After having both tools both in our dev and production, here is my comparative analysis of Kubeflow (2018, Google) and Metaflow (2019, Netflix). At a high level, both Flows facilitate ML Platform: Integration of ML components with storage components, support experiments logging, and reproducibility of ML experiments.
I would recommend watching these videos to illustrate the purpose of Machine Learning Platform: 3-min: https://youtu.be/sdbBcPuvw40 "Spell: Next Generation Machine Learning Platform", 33-min: https://youtu.be/lu5zHvpQeSI "Managing ML in Production with Kubeflow and DevOps - David Aronchick, Microsoft"
Also 10-min Metaflow read: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4
If I needed to explain in one word why one would need MLOps, it would be: "SCALE" Do you want to add "SCALE" to your ML?
The different teams want different things from the Machine Learning Platform. Data Science/ML Engineers:
working toward improving models' accuracies
need help scaling their experiments such as compute and processing data loads
does not work on improving models' accuracies
maintains clusters and runs production platform that facilitates ML teams
Below is my comparative analysis.
Distributed computation Both offer full support for pipelines and components running in parallel.
Cluster I do like Kubeflow that provides Kubernetes 'under the hood'. That solves future problems such as deploying on any cloud, since k8s is open-source. Also DevOps is often familiar and willing to learn k8s, as well as there is a significant amount of tools available for k8s monitoring from third parties
Data exchange between pipeline components (K) doesn't support communication between components but operates files. In my personal opinion that the strongest Metaflow feature is data between components is just Python objects.
Goal ... highly subjectively (M) provides Python-level Machine Learning workflows for running ML experiments at scale and reproducible (K) provides Docker-image-level Machine Learning workflows for running ML experiments at scale and reproducible In addition, Kueflow also runs the Kubernetes cluster. This turns Kubeflow into a powerful Machine Learning cluster.
Cloud (M) locks in AWS Batch and S3 (K) unlimited cloud deployment such as GCP, Azure, anything that runs k8s.
Model Serving via external tooling: (K) - AI Hub, or TFX at Google Cloud Platform (M) - Sagemaker/AWS for (M)
Horizontal scaling (K) is a Kubernetes cluster with powerful support for scaling and monitoring (M) AWS ECS
Costs Subjectively, but I tend to think that similar computational loads on Google Cloud Platform may cost less compared to AWS.
Here I want to summarize what one would get by deploying and utilizing Machine Learning Platform with Kubeflow or Metaflow:
Both Kubeflow and Metaflow support the most common Machine Learning scenarios such as managing code, data, dependencies for the experiments. Popular Machine Learning use cases described here: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4 At a high level, both (K) and (M) help with the following: Collaboration: keep track of and access to the experiments. Resume a run: A run failed (or was stopped intentionally). You can restart the workflow from where it failed/stopped. Hybrid runs: run one step of your workflow on high memory CPU-s (such as the data load and aggregation) and another compute-intensive step (the model training) on low-memory GPU. Inspecting experiments and using metadata: Data scientists can tune the hyperparameters on the same model and data. Kubeflow: running multiple versions of the same code at the same time on one cluster: one is free to choose version, language, packages for any step of one's workflow. In Kubeflow it is done using components implemented as independent Docker images. Not possible in Metaflow
Kubeflow: not locking in a particular cloud provider: abstracts out open-source Kubernetes and Airflow/other Orchestration. However, using TensorFlow Extended (TFX) will lock into Apache Beam, provided as Dataflow at Google Cloud Platform. (M) currently locks in AWS Sagemaker/S3/Batch
Running heavy parallel-distributed computational loads in a stable way: example: on unstable can be when a 'memory thirsty' process aborts other processes by taking all the memory resources. Kubeflow utilized Kubernetes computational distributed load.