Required reading: Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
Recommended readings:
All steps to create (and deploy) the model
Parameterize and use nbconvert
?
Large arguments of data
Distributed data storage
Distributed processing and learning
Special hardware needs
Fault tolerance
Humans in the loop
Danger of "silent" mistakes in many phases
Examples?
Danger of "silent" mistakes in many phases:
After exploration and prototyping build robust pipeline
One-off model creation -> repeatable automateable process
Enables updates, supports experimentation
Explicit interfaces with other parts of the system (data sources, labeling infrastructure, training infrastructure, deployment, ...)
Design for change
From experimental notebook code to production code
Each stage as a function or module
Well tested in isolation and together
Robust to changes in inputs (automatically adapt or crash, no silent mistakes)
Use good engineering practices (version control, documentation, testing, naming, code review)
# typical data science code from a notebook
df = pd.read_csv('data.csv', parse_dates=True)
# data cleaning
# ...
# feature engineering
df['month'] = pd.to_datetime(df['datetime']).dt.month
df['dayofweek']= pd.to_datetime(df['datetime']).dt.dayofweek
df['delivery_count'] = boxcox(df['delivery_count'], 0.4)
df.drop(['datetime'], axis=1, inplace=True)
dummies = pd.get_dummies(df, columns = ['month', 'weather', 'dayofweek'])
dummies = dummies.drop(['month_1', 'hour_0', 'weather_1'], axis=1)
X = dummies.drop(['delivery_count'], axis=1)
y = pd.Series(df['delivery_count'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# training and evaluation
lr = LinearRegression()
lr.fit(X_train, y_train)
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))
How to test??
Controllability:
Observability
def encode_day_of_week(df):
if 'datetime' not in df.columns: raise ValueError("Column datetime missing")
if df.datetime.dtype != 'object': raise ValueError("Invalid type for column datetime")
df['dayofweek']= pd.to_datetime(df['datetime']).dt.day_name()
df = pd.get_dummies(df, columns = ['dayofweek'])
return df
# ...
def prepare_data(df):
df = clean_data(df)
df = encode_day_of_week(df)
df = encode_month(df)
df = encode_weather(df)
df.drop(['datetime'], axis=1, inplace=True)
return (df.drop(['delivery_count'], axis=1),
encode_count(pd.Series(df['delivery_count'])))
def learn(X, y):
lr = LinearRegression()
lr.fit(X, y)
return lr
def pipeline():
train = pd.read_csv('train.csv', parse_dates=True)
test = pd.read_csv('test.csv', parse_dates=True)
X_train, y_train = prepare_data(train)
X_test, y_test = prepare_data(test)
model = learn(X_train, y_train)
accuracy = eval(model, X_test, y_test)
return model, accuracy
def pipeline():
train = pd.read_csv('train.csv', parse_dates=True)
test = pd.read_csv('test.csv', parse_dates=True)
X_train, y_train = prepare_data(train)
X_test, y_test = prepare_data(test)
model = learn(X_train, y_train)
accuracy = eval(model, X_test, y_test)
return model, accuracy
Dataflow frameworks like Luigi, DVC, Airflow, d6tflow, and Ploomber support distribution, fault tolerance, monitoring, ...
Hosted versions like DataBricks and AWS SageMaker Pipelines
def encode_day_of_week(df):
if 'datetime' not in df.columns: raise ValueError("Column datetime missing")
if df.datetime.dtype != 'object': raise ValueError("Invalid type for column datetime")
df['dayofweek']= pd.to_datetime(df['datetime']).dt.day_name()
df = pd.get_dummies(df, columns = ['dayofweek'])
return df
def test_day_of_week_encoding():
df = pd.DataFrame({'datetime': ['2020-01-01','2020-01-02','2020-01-08'], 'delivery_count': [1, 2, 3]})
encoded = encode_day_of_week(df)
assert "dayofweek_Wednesday" in encoded.columns
assert (encoded["dayofweek_Wednesday"] == [1, 0, 1]).all()
# more tests...
df['Join_year'] = df.Joined.dropna().map(
lambda x: x.split(',')[1].split(' ')[1])
df.loc[idx_nan_age,'Age'].loc[idx_nan_age] =
df['Title'].loc[idx_nan_age].map(map_means)
df["Weight"].astype(str).astype(int)
df['Reviws'] = df['Reviews'].apply(int)
df["Release Clause"] =
df["Release Clause"].replace(regex=['k'], value='000')
df["Release Clause"] =
df["Release Clause"].astype(str).astype(float)
Breaking code into functions/modules
Supports reuse, separate development, and testing
Can test individual parts
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
Source: Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017)
(from S20 midterm; assume cloud or hybrid deployment)
#lecture
, tagging group members, suggest what top 2 tests to implement and howimport org.junit.Test;
import static org.junit.Assert.assertEquals;
public class AdjacencyListTest {
@Test
public void testSanityTest(){
// set up
Graph g1 = new AdjacencyListGraph(10);
Vertex s1 = new Vertex("A");
Vertex s2 = new Vertex("B");
// check expected results (oracle)
assertEquals(true, g1.addVertex(s1));
assertEquals(true, g1.addVertex(s2));
assertEquals(true, g1.addEdge(s1, s2));
assertEquals(s2, g1.getNeighbors(s1)[0]);
}
// use abstraction, e.g. common setups
private int helperMethod…
}
Specification
Controlled environment
Test inputs (calls and parameters)
Expected outputs/behavior (oracle)
Working code, failing tests
"Works on my machine"
Tests break frequently
How to avoid?
Controllability:
Observability
Think about testing when writing code
Unit testing encourages you to write testable code
Separate parts of the code to make them independently testable
Abstract functionality behind interface, make it replaceable
Bonus: Test-Driven Development is a design and development method in which you always write tests before writing code
Automate all build, analysis, test, and deployment steps from a command line call
Ensure all dependencies and configurations are defined
Ideally reproducible and incremental
Distribute work for large jobs
Track results
Key CI benefit: Tests are regularly executed, part of process
Track quality indicators over time, e.g.,
Many tools: MLFlow, ModelDB, Neptune, TensorBoard, Weights & Biases, Comet.ml, ...
from verta import Client
client = Client("http://localhost:3000")
proj = client.set_project("My first ModelDB project")
expt = client.set_experiment("Default Experiment")
# log a training run
run = client.set_experiment_run("First Run")
run.log_hyperparameters({"regularization" : 0.5})
model1 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model1, validationData))
# original implementation hardcodes external API
def clean_gender(df):
def clean(row):
if pd.isnull(row['gender']):
row['gender'] = gender_api_client.predict(row['firstname'], row['lastname'], row['location'])
return row
return df.apply(clean, axis=1)
def test_do_not_overwrite_gender():
df = pd.DataFrame({'firstname': ['John', 'Jane', 'Jim'],
'lastname': ['Doe', 'Doe', 'Doe'],
'location': ['Pittsburgh, PA', 'Rome, Italy', 'Paris, PA '],
'gender': [np.nan, 'F', np.nan]})
out = clean_gender(df, model_stub)
assert(out['gender'] ==['M', 'F', 'M']).all()
def clean_gender(df, model):
def clean(row):
if pd.isnull(row['gender']):
row['gender'] = model(row['firstname'],
row['lastname'],
row['location'])
return row
return df.apply(clean, axis=1)
Replace concrete API with an interface that caller can parameterize
def test_do_not_overwrite_gender():
def model_stub(first, last, location):
return 'M'
df = pd.DataFrame({'firstname': ['John', 'Jane', 'Jim'], 'lastname': ['Doe', 'Doe', 'Doe'], 'location': ['Pittsburgh, PA', 'Rome, Italy', 'Paris, PA '], 'gender': [np.nan, 'F', np.nan]})
out = clean_gender(df, model_stub)
assert(out['gender'] ==['M', 'F', 'M']).all()
(Mocking frameworks provide infrastructure for expressing such tests compactly.)
Avoid silent errors
Recover locally if possible, propagate error if necessary -- fail entire task if needed
Explicitly handle exceptional conditions and mistakes
Test correct error handling
If logging only, is anybody analyzing log files?
def test_invalid_day_of_week_data():
df = pd.DataFrame({'datetime_us': ['01/01/2020'],
'delivery_count': [1]})
with pytest.raises(ValueError):
encode_day_of_week(df)
def test_learning_fails_with_missing_data():
df = pd.DataFrame({})
with pytest.raises(NoDataError):
learn(df)
Use stubs to inject artificial faults
## testing retry mechanism
from retry.api import retry_call
import pytest
# stub of a network connection, sometimes failing
class FailedConnection(Connection):
remaining_failures = 0
def __init__(self, failures):
self.remaining_failures = failures
def get(self, url):
print(self.remaining_failures)
self.remaining_failures -= 1
if self.remaining_failures >= 0:
raise TimeoutError('fail')
return "success"
# function to be tested, with recovery mechanism
def get_data(connection, value):
def get(): return connection.get('https://replicate.npmjs.com/registry/'+value)
return retry_call(get,
exceptions = TimeoutError, tries=3, delay=0.1, backoff=2)
# 3 tests for no problem, recoverable problem, and not recoverable
def test_no_problem_case():
connection = FailedConnection(0)
assert get_data(connection, '') == 'success'
def test_successful_recovery():
connection = FailedConnection(2)
assert get_data(connection, '') == 'success'
def test_exception_if_unable_to_recover():
connection = FailedConnection(10)
with pytest.raises(TimeoutError):
get_data(connection, '')
Is invalid data rejected / repaired?
Are missing data updates raising errors?
Are unavailable APIs triggering errors?
Are failing deployments reported?
Even when reported or mitigated, log the issue
Allows later analysis of frequency and patterns
Monitoring systems can raise alarms for anomalies
from prometheus_client import Counter
connection_timeout_counter = Counter(
'connection_retry_total',
'Retry attempts on failed connections')
class RetryLogger():
def warning(self, fmt, error, delay):
connection_timeout_counter.inc()
retry_logger = RetryLogger()
def get_data(connection, value):
def get(): return connection.get('https://replicate.npmjs.com/registry/'+value)
return retry_call(get,
exceptions = TimeoutError, tries=3, delay=0.1, backoff=2,
logger = retry_logger)
class MyNotificationService extends NotificationService {
public boolean receivedNotification = false;
public void sendNotification(String msg) {
receivedNotification = true; }
}
@Test void test() {
Server s = getServer();
MyNotificationService n = new MyNotificationService();
Monitor m = new Monitor(s, n);
s.stop();
s.request(); s.request();
wait();
assert(n.receivedNotification);
}
Like fire drills (manual tests may be okay!)
Manual tests in production, repeat regularly
Actually take down service or trigger wrong signal to monitor
{
"version": "1.0.0",
"title": "What is the impact of an expired certificate on our application chain?",
"description": "If a certificate expires, we should gracefully deal with the issue.",
"tags": ["tls"],
"steady-state-hypothesis": {
"title": "Application responds",
"probes": [
{
"type": "probe",
"name": "the-astre-service-must-be-running",
"tolerance": true,
"provider": {
"type": "python",
"module": "os.path",
"func": "exists",
"arguments": {
"path": "astre.pid"
}
}
},
{
"type": "probe",
"name": "the-sunset-service-must-be-running",
"tolerance": true,
"provider": {
"type": "python",
"module": "os.path",
"func": "exists",
"arguments": {
"path": "sunset.pid"
}
}
},
{
"type": "probe",
"name": "we-can-request-sunset",
"tolerance": 200,
"provider": {
"type": "http",
"timeout": 3,
"verify_tls": false,
"url": "https://localhost:8443/city/Paris"
}
}
]
},
"method": [
{
"type": "action",
"name": "swap-to-expired-cert",
"provider": {
"type": "process",
"path": "cp",
"arguments": "expired-cert.pem cert.pem"
}
},
{
"type": "probe",
"name": "read-tls-cert-expiry-date",
"provider": {
"type": "process",
"path": "openssl",
"arguments": "x509 -enddate -noout -in cert.pem"
}
},
{
"type": "action",
"name": "restart-astre-service-to-pick-up-certificate",
"provider": {
"type": "process",
"path": "pkill",
"arguments": "--echo -HUP -F astre.pid"
}
},
{
"type": "action",
"name": "restart-sunset-service-to-pick-up-certificate",
"provider": {
"type": "process",
"path": "pkill",
"arguments": "--echo -HUP -F sunset.pid"
},
"pauses": {
"after": 1
}
}
],
"rollbacks": [
{
"type": "action",
"name": "swap-to-vald-cert",
"provider": {
"type": "process",
"path": "cp",
"arguments": "valid-cert.pem cert.pem"
}
},
{
"ref": "restart-astre-service-to-pick-up-certificate"
},
{
"ref": "restart-sunset-service-to-pick-up-certificate"
}
]
}
Usually assume ML libraries already tested (pandas, sklearn, etc)
Focus on custom code
Consider tests of latency, throughput, memory, ...
Test correct detection of problems
def test_invalid_day_of_week_data():
...
Test correct error handling or repair of detected problems
def test_fill_missing_gender():
...
def test_exception_for_missing_data():
...
num = data.Size.replace(r'[kM]+$', '', regex=True).
astype(float)
factor = data.Size.str.extract(r'[\d\.]+([KM]+)',
expand =False)
factor = factor.replace(['k','M'], [10**3, 10**6]).fillna(1)
data['Size'] = num*factor.astype(int)
data["Size"]= data["Size"].
replace(regex =['k'], value='000')
data["Size"]= data["Size"].
replace(regex =['M'], value='000000')
data["Size"]= data["Size"].astype(str). astype(float)
Execute training with small sample data
Ensure shape of model and data as expected (e.g., tensor dimensions)
Test error handling for detecting connection/data problems
Test larger units of behavior
Often based on use cases or user stories -- customer perspective
@Test void gameTest() {
Poker game = new Poker();
Player p = new Player();
Player q = new Player();
game.shuffle(seed)
game.add(p);
game.add(q);
game.deal();
p.bet(100);
q.bet(100);
p.call();
q.fold();
assert(game.winner() == p);
}
Test combined behavior of multiple functions
def test_cleaning_with_feature_eng() {
d = load_test_data();
cd = clean(d);
f = feature3.encode(cd);
assert(no_missing_values(f["m"]));
assert(max(f["m"]) <= 1.0);
}
// making predictions with an ensemble of models
function predict_price(data, models, timeoutms) {
// send asynchronous REST requests all models
const requests = models.map(model => rpc(model, data, {timeout: timeoutms}).then(parseResult).catch(e => -1))
// collect all answers and return average if at least two models succeeded
return Promise.all(requests).then(predictions => {
const success = predictions.filter(v => v >= 0)
if (success.length < 2) throw new Error("Too many models failed")
return success.reduce((a, b) => a + b, 0) / success.length
})
}
// test ensemble of models
const timeout = 500, M1 = "http://localhost:3000/predict", ...
beforeAll(() => {
// launch model 1 API at address M1
// launch model 2 API at address M2
// launch model API with timeout at address M3
}
afterAll(() => { /* shut down all model APIs */ }
test("success despite timeout", async () => {
const start = performance.now();
const val = await predict_price(input, [M1, M2, M3], timeout)
expect(performance.now() - start).toBeLessThan(2 * timeout)
expect(val).toBeGreaterThan(0)
})
test("fail on too many timeouts", async () => {
const start = performance.now();
const val = await predict_price(input, [M1, M3, M3], timeout)
expect(performance.now() - start).toBeLessThan(2 * timeout)
expect(val).toThrow()
})
def test_pipeline():
train = pd.read_csv('pipelinetest_training.csv', parse_dates=True)
test = pd.read_csv('pipelinetest_test.csv', parse_dates=True)
X_train, y_train = prepare_data(train)
X_test, y_test = prepare_data(test)
model = learn(X_train, y_train)
accuracy = eval(model, X_test, y_test)
assert accuracy > 0.9
Test the product as a whole, not just components
Click through user interface, achieve task (often manually performed)
Derived from requirements (use cases, user stories)
Testing in production
Manual inspection of code
Modern code review: Incremental review at checking
df['Join_year'] = df.Joined.dropna().map(
lambda x: x.split(',')[1].split(' ')[1])
df.loc[idx_nan_age,'Age'].loc[idx_nan_age] =
df['Title'].loc[idx_nan_age].map(map_means)
df["Weight"].astype(str).astype(int)
df['Reviws'] = df['Reviews'].apply(int)
Automatic detection of problematic patterns based on code structure
if (user.jobTitle = "manager") {
...
}
function fn() {
x = 1;
return x;
x = 3;
}
Examples:
Sadowski, Caitlin, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. "Lessons from building static analysis tools at google." Communications of the ACM 61, no. 4 (2018): 58-66.
Miscoding
Outliers and scaling
Packaging