Required readings
What went wrong? Where? How to fix?
Version all models involved!
Example adapted from Jon Peck. Chaining machine learning models in production with Algorithmia. Algorithmia blog, 2019
Image: Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE. 2020.
Assume you are receiving complains that a child gets many recommendations about R-rated movies
In a group, discuss how you could address this in your own system and post to #lecture
, tagging team members:
K.G Orphanides. Children's YouTube is still churning out blood, suicide and cannibalism. Wired UK, 2018; Kristie Bertucci. 16 NSFW Movies Streaming on Netflix. Gadget Reviews, 2020
InquiryID,CustomerID,InquiryDate,LoanType,LoanAmount,AccountStatus,PaymentStatus
1001,001,2020-01-15,Mortgage,250000,Open,Current
1002,002,2020-02-20,Auto Loan,20000,Closed,Paid Off
1003,003,2020-03-05,Credit Card,5000,Open,Late (30 days)
1004,004,2020-04-10,Personal Loan,10000,Open,Current
1005,005,2020-05-15,Student Loan,30000,Closed,Paid Off
1006,001,2020-06-20,Mortgage,200000,Open,Current
1007,002,2020-07-25,Credit Card,7000,Open,Late (60 days)
1008,003,2020-08-30,Auto Loan,15000,Closed,Paid Off
1009,004,2020-09-10,Personal Loan,8000,Open,Current
1010,005,2020-10-15,Credit Card,10000,Open,Late (90 days)
(example customer data from the credit scenario)
createUser(id=5, name="Christian", dpt="SCS")
updateUser(id=5, dpt="ISR")
deleteUser(id=5)
Scott Chacon and Ben Straub. Pro Git. 2014
Usually no meaningful delta/compression, version as binary objects
Any system to track versions of blobs
Associate model version with pipeline code version, data version, and hyperparameters!
Pipelines depend on many frameworks and libraries
Ensure reproducible builds
Optionally: Version entire environment (e.g. Docker container)
Test build/pipeline on independent machine (container, CI server, ...)
Tracking data, pipeline, and model versions
Modeling pipelines: inputs and outputs and their versions
Often tracking also metadata about versions
dvc add images
dvc run -d images -o model.p cnn.py
dvc remote add myrepo s3://mybucket
dvc push
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- data/clean
params:
- levels.no
outs:
- features
metrics:
- performance.json
training:
desc: Train model with Python
cmd:
- pip install -r requirements.txt
- python train.py --out ${model_file}
deps:
- requirements.txt
- train.py
- features
outs:
- ${model_file}:
desc: My model description
plots:
- logs.csv:
x: epoch
x_label: Epoch
meta: 'For deployment'
# User metadata and comments are supported
Log information within pipelines: hyperparameters used, evaluation results, and model files
Many tools: MLflow, ModelDB, Neptune, TensorBoard, Weights & Biases, Comet.ml, ...
from verta import Client
client = Client("http://localhost:3000")
proj = client.set_project("My first ModelDB project")
expt = client.set_experiment("Default Experiment")
# log the first run
run = client.set_experiment_run("First Run")
run.log_hyperparameters({"regularization" : 0.5})
run.log_dataset_version("training_and_testing_data", dataset_version)
model1 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model1, validationData))
run.log_model(model1)
# log the second run
run = client.set_experiment_run("Second Run")
run.log_hyperparameters({"regularization" : 0.8})
run.log_dataset_version("training_and_testing_data", dataset_version)
model2 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model2, validationData))
run.log_model(model2)
Automatically derive data dependencies from system log files
Track metadata for each table
No manual tracking/dependency declarations needed
Requires homogeneous infrastructure
Similar systems for tracking inside databases, MapReduce, Sparks, etc.
Decide which model version to run where
Track which prediction has been performed with which model version (logging)
Key goal: If a customer complains about an interaction, can we reproduce the prediction with the right model? Can we debug the model's pipeline and data? Can we reproduce the model?
<date>,<model>,<model version>,<feature inputs>,<output>
<date>,<model>,<model version>,<feature inputs>,<output>
<date>,<model>,<model version>,<feature inputs>,<output>
Ensure all predictions are logged
Historical record of data and its origin
Whenever value is changed, record:
Embedded as feature in some databases or implemented in business logic
Possibly signing with cryptographic methods
Document all data sources
Identify all model dependencies and flows
Ideally model all data and processing code
Avoid "visibility debt"
(Advanced: Use infrastructure to automatically capture/infer dependencies and flows as in Goods)
How are features extracted from raw data?
Has feature extraction changed since the model was trained?
Recommendation: Modularize and version feature extraction code
Example?
Stores feature extraction code as functions, versioned
Catalog features to encourage reuse
Compute and cache features centrally
Use same feature used in training and inference code
Advanced: Immutable features -- never change existing features, just add new ones (e.g., creditscore, creditscore2, creditscore3)
How was the model trained?
What data? What library? What hyperparameter? What code?
Ensemble of multiple models?
Example adapted from Jon Peck. Chaining machine learning models in production with Algorithmia. Algorithmia blog, 2019
Image: Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE. 2020.
Data provenance
Feature provenance
Model provenance
Assume you are receiving complains that a child gets mostly recommendations about R-rated movies
Discuss again, updating the previous post in #lecture
:
Replicability: ability to reproduce results exactly
Reproducibility: the ability of an experiment to be repeated with minor differences, achieving a consistent expected result
Juristo, Natalia, and Omar S. Gómez. "Replication of software engineering experiments." In Empirical software engineering and verification, pp. 60-88. Springer, Berlin, Heidelberg, 2010.
2019 Study of 1.4M notebooks on GitHub:
2020 Study of 936 executable notebooks:
🗎 Pimentel, João Felipe, et al. "A large-scale study about quality and reproducibility of jupyter notebooks." In Proc. MSR, 2019. and 🗎 Wang, Jiawei, K. U. O. Tzu-Yang, Li Li, and Andreas Zeller. "Assessing and restoring reproducibility of Jupyter notebooks." In Proc. ASE, 2020.
Ability to generate the same research results or predictions
Recreate model from data
Requires versioning of data and pipeline (incl. hyperparameters and dependencies)
Provenance is important for debugging and accountability
Data provenance, feature provenance, model provenance
Reproducibility vs replicability
Version everything!
See also Hulten. Building Intelligent Systems. Chapter 21
See also Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "On human intellect and machine failures: troubleshooting integrative machine learning systems." In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 1017-1025. 2017.
Example and image from: Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "On human intellect and machine failures: troubleshooting integrative machine learning systems." In Proc. AAAI. 2017.
Example and image from: Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "On human intellect and machine failures: troubleshooting integrative machine learning systems." In Proc. AAAI. 2017.
Example and image from: Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "On human intellect and machine failures: troubleshooting integrative machine learning systems." In Proc. AAAI. 2017.