Crunch 2022
Isabel Zimmerman, RStudio, PBC
September 20, 2022
if you develop modelsā¦
you can operationalize them
if you develop modelsā¦
you should operationalize them
well, some of them
information -> š¶ -> actions
information -> model -> actions
a set of practices to deploy and maintain machine learning models in production reliably and efficiently
import pandas as pd
raw = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
df = raw[["like_count", "funny", "show_product_quickly", "patriotic", \
"celebrity", "danger", "animals"]].dropna()
print(df)
like_count funny show_product_quickly patriotic celebrity danger \
0 1233.0 False False False False False
1 485.0 True True False True True
2 129.0 True False False False True
3 2.0 False True False False False
4 20.0 True True False False True
.. ... ... ... ... ... ...
241 10.0 True False True True False
243 572.0 False True True False False
244 14.0 True False False True True
245 12.0 True False False False True
246 334.0 False False False True False
animals
0 False
1 False
2 True
3 False
4 True
.. ...
241 True
243 True
244 False
245 False
246 False
[225 rows x 7 columns]
import pandas as pd
raw = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
df = raw[["like_count", "funny", "show_product_quickly", "patriotic", \
"celebrity", "danger", "animals"]].dropna()
from sklearn import model_selection, preprocessing, pipeline, ensemble
X_train, X_test, y_train, y_test = model_selection.train_test_split(
df.drop(columns = ['like_count']),
df['like_count'],
test_size=0.2
)
import pandas as pd
raw = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
df = raw[["like_count", "funny", "show_product_quickly", "patriotic", \
"celebrity", "danger", "animals"]].dropna()
from sklearn import model_selection, preprocessing, pipeline, ensemble
X_train, X_test, y_train, y_test = model_selection.train_test_split(
df.drop(columns = ['like_count']),
df['like_count'],
test_size=0.2
)
oe = preprocessing.OrdinalEncoder().fit(X_train)
rf = ensemble.RandomForestRegressor().fit(oe.transform(X_train), y_train)
import pandas as pd
raw = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
df = raw[["like_count", "funny", "show_product_quickly", "patriotic", \
"celebrity", "danger", "animals"]].dropna()
from sklearn import model_selection, preprocessing, pipeline, ensemble
X_train, X_test, y_train, y_test = model_selection.train_test_split(
df.drop(columns = ['like_count']),
df['like_count'],
test_size=0.2
)
oe = preprocessing.OrdinalEncoder().fit(X_train)
rf = ensemble.RandomForestRegressor().fit(oe.transform(X_train), y_train)
rf_pipe = pipeline.Pipeline([('ordinal_encoder',oe), ('random_forest', rf)])
model
model_final
model_final_v2
model_final_v2_ACTUALLY
managing change in models
import pins
model_board = pins.board_temp(
allow_pickle_read = True)
from vetiver import VetiverModel, vetiver_pin_write
v = VetiverModel(rf_pipe, "ads", ptype_data = X_train)
vetiver_pin_write(model_board, v)
model_board.pin_meta("ads")
Meta(title='ads: a pinned Pipeline object', description="Scikit-learn <class 'sklearn.pipeline.Pipeline'> model", created='20221003T221304Z', pin_hash='612d4b523ca8c0ef', file='ads.joblib', file_size=432866, type='joblib', api_version=1, version=Version(created=datetime.datetime(2022, 10, 3, 22, 13, 4), hash='612d4'), name='ads', user={'ptype': '{"funny": true, "show_product_quickly": true, "patriotic": false, "celebrity": false, "danger": false, "animals": false}', 'required_pkgs': ['vetiver', 'scikit-learn']})
where are these boards hosted?
putting a model in production
putting a model in production somewhere that is not on your local laptop
putting a model in production somewhere that is not on your local laptop
putting a model in production somewhere that is not on your local laptop
ā using REST APIs
import vetiver
from sklearn import metrics
from datetime import timedelta
metric_set = [metrics.mean_absolute_error, metrics.mean_squared_error]
metrics = vetiver.compute_metrics(
new_data,
"date",
timedelta(weeks = 1),
metric_set,
"like_count",
"preds"
)
m = vetiver.plot_metrics(metrics)
m.update_yaxes(matches=None)
m.show()
when things go wrong:
Model Cards provide a framework for transparent, responsible reporting.
Use the vetiver `.qmd` Quarto template as a place to start,
with vetiver.model_card()
From Mitchell et al. (2019):
Therefore the usefulness and accuracy of a model card relies on the integrity of the creator(s) of the card itself.
you (and your team!) are unique!
(and able to do the MLOps tasks we want)
a set of practices to deploy and maintain machine learning models in production reliably and efficiently
versioning
deploying
monitoring
vetiver
can help with this for your R and Python models!
Documentation at https://vetiver.rstudio.com/
Recent screencast on deploying a model with Docker
End-to-end demos from RStudio Solutions Engineering in R and Python
These slides! Visit isabel.quarto.pub/crunch2022