A photo of a building that combines a very modern facade on one side of the building with a classic decorated facade on another side of the same building.

Chapter 9

Quality Attributes of ML Components

When decomposing the system into components, designers need to identify for each component which quality requirements are necessary to achieve the requirements of the overall system. At the same time, some components may have inherent qualities that make it difficult or even impossible to achieve the system requirements—in such cases, the overall design needs to be reconsidered. With the introduction of machine learning, software engineers need to understand the common relevant qualities of ML components and what expectations are realistic, and data scientists should be aware of what qualities might be relevant for the rest of the system, beyond just model accuracy. In this chapter, we survey common qualities of interest and how to identify constraints and negotiate trade-offs—both for individual ML components and for the system as a whole.

Scenario: Detecting Credit Card Fraud

While we will illustrate the range of different qualities with different use cases throughout this chapter, we will use one running example of automated credit card fraud detection offered as a service to banks. Fraudulent credit card transactions come in different shapes, but they often occur in patterns that can be detected. As a reaction, criminals then tend to explore new strategies constantly, trying to evade existing detection strategies.

Consider a company that develops a new way of detecting credit card fraud with high accuracy, using a combination of (a) classic anomaly detection with many handwritten features and (b) a novel deep neural network model that considers customer profiles built on third-party data from advertising networks beyond just the customer’s past credit-card transactions. The company offers its services to banks, who pay a small fee per analyzed transaction. In addition to automated detection of fraud, the company also employs a significant number of humans to manually review transactions and follow up with the bank’s customers. The banks using this service provide access to a real-time data feed of transactions and fraud claims.

From System Quality to Model and Pipeline Quality

A key part of requirements engineering in any software project is typically to identify the relevant quality requirements for the system, in addition to behavioral requirements. In traditional software projects, quality requirements may include scalability, response time, cost of operation, usability, maintainability, safety, security, and time to release. In our credit card scenario, we want to detect fraud accurately and quickly, react rapidly to evolving fraud schemes, and make a profit from the sheer volume of transactions and low human involvement.

As discussed in chapter From Models to Systems, machine-learning components—including the learned models, the training pipeline, and the monitoring infrastructure—are part of a larger system and those components need to support the system’s quality goals:

Our credit card fraud detection algorithm that needs to react quickly to changing fraud patterns will not be well supported by a model that takes weeks to (re-)train and deploy. The system’s quality requirement modifiability conflicts with the machine-learning pipeline’s training latency.
A recommendation algorithm on a shopping website that needs minutes to provide a ranking will not be usable in an interactive setting. The system’s quality requirement of fast average response times conflicts with the model’s inference latency.
A voice-activated smart home device is not well served by a monitoring infrastructure that sends all audio recordings to a cloud server for analysis, transferring massive amounts of audio data each day. The system’s quality requirements regarding privacy and operating cost conflict with the monitoring infrastructure’s goals of comprehensive monitoring.

These examples highlight some of the many possible quality goals for machine-learning components within a system, not just prediction accuracy. A key step in designing the system, when decomposing the system into components, is to identify which qualities of machine-learning components are important to achieve the system’s quality requirements. Understanding the system goals can influence what design decisions are feasible within machine-learning components and how to negotiate trade-offs between different qualities. For example, analyzing the system quality requirements of our fraud-detection system, we may realize that throughput without massive recurring infrastructure cost is essential, so we may need to make compromises when using deep learning to avoid excessive inference costs from large models or design a two-stage process with a faster screening model first and more expensive model only analyzing a subset of transactions (we will come back to this pattern in chapter Deploying a Model). Conversely, understanding limitations of machine-learning components can also inform design decisions for other parts of the system and whether the system as a whole is feasible at all with the desired qualities. For example, establishing accuracy estimates and per-transaction inference costs for fraud detection can inform how much human oversight is needed and whether a business model that relies only on very low per-transaction fees is feasible.

On terminology. As mentioned in chapter Gathering Requirements, software engineers tend to speak of quality attributes, quality requirements, or non-functional requirements to describe desirable qualities of a system, such as latency, safety, and usability, and operators use service-level objectives. In contrast, data scientists sometimes speak of model properties when referring to accuracy, inference latency, fairness, and other qualities of a learned model, but that term is also used for capabilities of the learning algorithm. We will use the software-engineering term quality attribute to refer to the qualities of a component in a system, including machine-learning components.

Common Quality Attributes

Most discussions on quality attributes for machine learning focus on prediction accuracy to the exclusion of most other qualities. With the rise of large language models, many developers also have become painfully aware of inference cost and latency of models. In the following, we provide an overview of common quality attributes that may be relevant to consider for machine-learning algorithms and the models they learn.

Common Quality Attributes of Machine-Learned Models

The primary quality attribute considered by data scientists when building models is usually prediction accuracy, that is, how well the model learns the intended concepts for predictions. There are many ways to measure accuracy and break it down by different kinds of mistakes or subgroups, as we will discuss in chapter Model Quality.

In many production settings, time-related quality attributes are important. Inference latency is the measure of how long it takes to make a single prediction. Some models make near-instant predictions like the log n decision evaluations in a decision tree with n internal nodes. Other predictions require significant computational resources, such as evaluating the deep neural network of a large language model with millions or billions of floating point multiplications repeatedly to generate an answer to a prompt. Some models have very consistent and predictable inference latency, whereas latency depends on the specific input for others—hence, it is common to report the median latency or the 90 percentile latency. If the model is hosted remotely, network latency is added to inference latency. Inference throughput is a related measure of how many predictions can be made in a given amount of time, for example, when applied during batch processing. Scalability typically refers to how throughput can be increased as demand increases, typically by distributing the work across multiple machines, say in a cloud infrastructure. In our credit card fraud scenario, latency is not critical as long as it is under a few seconds, but high throughput is vital given the large number of transactions to be processed.

Several model quality attributes inform the hardware needed for inference, including model size and memory footprint. Model size, typically simply measured as the file size of the serialized model, influences how much data must be transmitted for every model update (e.g., to end users as an app update on a phone). Large file sizes for models can be particularly problematic when versioning models and thus having to store many large files. Again, decision trees tend to be comparably small in practice, whereas even small deep neural networks can be of substantial size. For example, a typical introductory example to classify images in the MNIST Fashion dataset (28 by 28 pixel grayscale images, 10 output classes) with a three-layer feed-forward network of 300, 100, and 10 neurons has 266,610 parameters—if each parameter is stored as a 4-byte float, this would require 1 megabyte just for storing serialized model parameters. State-of-the-art deep neural network models are much bigger. For example, OpenAI’s GPT-3 model from 2020 has 96 layers, about 175 billion weights, and needs about 700 gigabytes in memory (one order of magnitude more memory than even high-end desktop computers in 2020 usually had).

In some settings, the energy consumption per prediction (on given hardware) is very relevant. Limits to available energy can seriously constrain what kind of models can be deployed in mobile phones and battery-powered smart devices. In data centers, energy costs and corresponding cooling needs for repeated inference on large language models can add up.

Furthermore, in some contexts, it can be useful to know that predictions are deterministic—that is, a model always makes the same prediction for the same input. Among others, deterministic inference can simplify monitoring and debugging. While many learning algorithms are nondeterministic, almost all models learned with machine-learning techniques are deterministic during inference. For example, a decision tree will always take the same path for the same inputs and a neural network will always compute the same floating point numbers for the same inputs, reaching the same prediction. Many generative models intentionally introduce nondeterminism by sampling from a probability distribution during inference to generate more diverse and creative outputs, but this is optional and can usually be disabled.

Many of these model quality attributes, directly and indirectly, influence the cost of predictions through hardware needs and operating costs. Larger models require more expensive hardware, deep learning models relying heavily on floating point operations benefit from GPUs, and higher throughput demand can be served with more computational resources even for slower models. Some companies like Google and Tesla even develop specialized hardware to support the necessary computing power for the vast amount of floating point computations needed by deep neural models, while meeting latency or throughput requirements in the real-time setting of an automated system in a phone or car that receives a constant stream of sensor inputs. An often useful measure that captures the operating cost for a specific model is the cost per prediction (which also factors in costs for model training). If the benefits of the model in a production system (e.g., more sales, detected fraud, ad revenue) or the cost a client is willing to cover does not outweigh the cost per prediction of the model, it is simply not economically viable. For example, Microsoft reportedly initially lost an average of $20 per paying customer per month on their $10 GitHub Copilot subscription due to the high inference costs. In our credit card fraud detection scenario, the cost per prediction is an important measure, because revenue is directly related to the volume of transactions.

Beyond quality attributes relevant to serving the model, engineers are often interested in further model qualities that influence how the model can be used in production as part of a system and how the model supports or inhibits achieving system requirements. Interpretability and explainability are often important qualities for a model in a system, which describe to what degree a human can understand the internals of a model (e.g., for debugging and auditing) and to what degree the model can provide useful explanations for why it predicts a certain output for a given input. Model fairness characterizes various differences in accuracy or outcomes across different regions of the input, typically split by gender, race, or other protected attributes. Several further qualities are sometimes considered in the context of safety, security, and privacy: A model’s robustness characterizes to what degree a model’s predictions are stable when the input is changed in minor ways, whether due to random noise or intentional attacks. A model is called calibrated when the confidence scores the model produces reflect the actual probability that the prediction is correct. Some researchers have also suggested assessing privacy as a quality attribute, for example, measuring to what degree individual training data can be recovered from a model. We will discuss these quality concerns and how they relate to larger system design considerations in the responsible engineering chapters.

Most quality attributes of models can be severely influenced by the choice of machine-learning algorithm (and hyperparameter) used to train the model. Many, including accuracy, robustness, and fairness, also depend heavily on the training data used to train the model.

Common Quality Attributes of Machine-Learning Algorithms

In addition to quality attributes for the learned model, engineers also often need to make decisions that consider quality attributes of the learning process, especially if the model is to be retrained regularly in production.

A key concern about the training process is training latency, that is, how long it takes to train or retrain a model. Some machine-learning algorithms can be distributed more or less easily (learning is usually not as easily parallelized as serving, as we will briefly discuss in chapter Scaling the System). Some learning algorithms scale better to (1) larger datasets or (2) more features than others. Hardware requirements regarding memory, CPU, and GPU also vary widely between learning algorithms, such as deep learning benefitting substantially from GPUs for all the floating point arithmetic involved in learning. All this influences training cost, which again informs the amount of experimentation that is feasible for an organization and the frequency that models can be retrained feasibly in production. For example, the GPT-3 model is estimated to have cost between 4 and 12 million US dollars for necessary computing resources alone (about 355 GPU-years) for a single training run. However, even very high training costs may be amortized if a model is used extensively and not frequently retrained, as many foundation models are. In our fraud-detection scenario, moderate to high training costs may be acceptable as they are amortized across many predictions; regular retraining to account for new fraud patterns will be necessary, but probably occur not much more often than daily or weekly.

The resources that organizations are able or willing to invest for (re-)training and experimentation vary widely between organizations. Especially when it comes to very expensive training of foundation models with extremely large datasets, large organizations with plenty of funding have a significant advantage over smaller organizations, raising concerns that a few large companies may dominate the market for certain kinds of models and hinder competition. In addition to costs, energy consumption during training has also received attention. In contrast to concerns about battery life during inference, concerns during training relate to high energy consumption and corresponding CO₂ emissions.

Some machine-learning algorithms allow incremental training, which can significantly reduce training costs in some settings, especially if more training data is added over time, say from the telemetry of the production system. For example, deep neural networks are trained incrementally to begin with and can be continuously trained with new data, whereas the standard decision-tree algorithm needs to look at all training data at once and hence needs to start over from scratch when the training data changes. If we can access live fraud claims in the credit card fraud scenario, incremental training on live data might surface new fraud patterns very quickly without spending the resources for retraining the entire model from scratch.

In a production system, the cost per prediction may be dominated by training costs, which can, in some settings, dwarf the inference costs when serving the model in production—especially if extensive experimentation is involved in building the model, when models are frequently retrained, or when the volume of predictions is low.

Beyond quality attributes related to cost and scalability, there are also a number of other considerations for training algorithms that are relevant in some settings. For example, some algorithms may work better on smaller datasets, some require less investment in feature engineering, some only learn linear relationships among features, some are more robust to noisy training data, and some are more stable and reproducible, possibly even deterministic in training. For example, deep neural networks are highly nondeterministic in training (due to random initial seeds and timing and batching differences during distributed learning) and may produce models with substantial differences in accuracy even when using the exact same hyperparameters; in contrast, basic decision trees algorithms are entirely deterministic and will reproduce the same model given the same data and hyperparameters.

Other Quality Attributes in ML-Enabled Systems

Beyond quality attributes of models and machine-learning algorithms, there are also quality attributes of other components that will influence design decisions to achieve system-wide quality goals. For example, if models are to be retrained regularly, automation, stability, reproducibility, and especially observability of the entire machine-learning pipeline become more important. We may want to push new models into production quickly (deployment latency) or support continuous learning and continuous experimentation in production. In our fraud detection scenario, we likely want to plan for regular model updates from the start and hence will value automation and observability. Similarly, we might care about quality attributes of our monitoring infrastructure, such as how much data is produced, whether private data is anonymized, how sensitive our monitoring instruments are, and how quickly anomalies can be reported. We will discuss these properties in chapters Automating the Pipeline and Planning for Operations.

A Checklist

The following list summarizes some of the most important quality attributes for ML components and can be used as a checklist:

Set minimum accuracy expectations, if possible. In many cases, there is an existing baseline below which the model does not make useful contributions to the system. Needs for robustness and calibration of a model in the context of the system should also be considered.
Identify runtime needs at inference time for the model. This may involve estimating the number of predictions needed in the production system, latency requirements, and a cost budget for operating the inference service. The deployment architecture will influence needs here significantly and will conversely be informed by the achievable qualities of the model, and hence may require negotiation between multiple teams.
Understand what data is available (quantity, quality, formats, provenance), which may inform what machine-learning techniques are feasible. This may also conversely inform other stakeholders whether more data needs to be collected to build a desired model.
Identify evolution needs for the model, that is, how often the model will need to be updated and what latency is needed for those updates. Again, different system designs can impose different model quality requirements, and there are opportunities to negotiate different system designs. Understanding the budget for training and experimentation will help to make informed design decisions at the model level. Evolution needs and the amount of drift to be expected also strongly influence observability needs for the model, the ML pipeline, and the system as a whole.
Identify explainability needs for the model in the system. The system design and user interface may impose specific requirements on explainability, and explainability may provide opportunities for different designs.
Identify protected characteristics and fairness concerns in the system, how the model relates to them, and what level of assurance or auditing will be needed. This may impose restrictions on possible model designs and machine-learning algorithms.
Identify how security and privacy concerns in the system relate to the model, including legal and ethical concerns. This may impose constraints on what data can be used and how it can be used or what machine-learning techniques can be applied to not leak private information.

Constraints and Trade-offs

Understanding quality requirements for machine-learning components will help data scientists make more informed decisions as they select machine-learning algorithms and develop models. A model will never excel at all qualities equally, so data scientists and software engineers will need to make decisions that trade off different qualities, for example, how much loss of accuracy is acceptable to cut the inference latency in half. Many trade-off decisions will be nonlocal and involve stakeholders of different teams, for example, (1) can the front-end team sacrifice some explainability of the model for improved accuracy, (2) can software engineers better assure safety with a model that is less accurate but calibrated, and (3) can the operations team provide the infrastructure to retrain the model daily and still have the capacity for experimentation? Being explicit about quality requirements both at the system and component level facilitates such discussions within and across teams.

It is often useful to consider possible decisions as a form of design space exploration. The central idea is that many possible design decisions (e.g., what machine-learning algorithm to use, what hyperparameters, what hardware resources, how much investment in feature engineering or additional data collection) that interact and together form the design space of all possible designs. Identifying hard constraints that are not negotiable reduces the design space to those that are not obviously infeasible. For example, in our credit fraud detection scenario, we may know that cost per prediction cannot exceed $0.001 to make a profit; in the real-time setting of analyzing video at 25 frames per second, inference latency cannot exceed 1/25 seconds (40 ms). Any solution that does not meet these hard constraints can be discarded, allowing us to focus on the remaining smaller design space.

A drawing illustrating a space with multiple lines cutting through it at different angles indicating constraints, such as minimum accuracy and minimum inference latency. The inner space among all the lines is highlighted as the space of feasible designs. — Design space exploration: the space of all possible designs (dotted rectangle) is reduced by several constraints on quality attributes, leaving only a subset of designs for further consideration (highlighted center area).

The remaining feasible solutions in the design space meet the hard constraints but usually are not all equally desirable. Some may correspond to more accurate models, some to more explainable ones, some to lower training costs and so forth. If a design is worse in all quality attributes than another design, it does not need to be considered further—it is dominated by the other design. The remaining designs are on the Pareto front, where each design is better than every other design on at least one quality attribute and worse on at least another quality attribute.

A scatter plot with prediction error on one axis and the cost of prediction on the other axis shows several dots indicating different systems. Four dots that are closest to the axes are highlighted as being on the Pareto front. — Trade-offs among multiple design solutions along two dimensions (cost and error). Gray solutions are all dominated by others that are better both in terms of cost and error (e.g., solution D has worse error rates and worse cost than solution A). The remaining black solutions are each better than another solution on one dimension but worse on another—they are all Pareto optimal—which solution to pick depends on the relative importance of the dimensions.

Which design to choose on a Pareto front depends on the relative importance of the involved qualities—a designer must now find a compromise. The designer could optimize for a single quality (e.g., prediction accuracy) or balance between multiple qualities. If relative importance or a utility function for the different qualities was known, we could identify the sweet spot mathematically. In practice though, making such trade-off decisions typically involves (1) negotiating between different stakeholders and (2) engineering judgment. For example, is explainability or a marginal improvement in accuracy in the fraud-detection scenario worth a $0.0002 increase in the cost per prediction? Different team members may have different opinions and need to negotiate an agreement. Again, making the trade-offs explicit will help foster such negotiation as it highlights which quality attributes are in conflict and forces a discussion about which quality attributes are more important for achieving the system goals overall.

While the general idea of trade-offs is straightforward and may seem somewhat simplistic, difficult trade-off decisions are common in production machine-learning projects. For example, many submissions to Netflix’s famous competition for the best movie-recommendation model produced excellent results, but Netflix engineers stated: “We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.” Similarly, with large language models, many organizations cut back on their use and opt for smaller and faster models to combat excessive inference costs despite lower accuracy.

As with all requirements and design work, this process is not easy and will go through multiple iterations. Typically this involves talking to various stakeholders (e.g., customers, operators, developers, data scientists, business experts, legal experts) to understand the problems and the needs for the system and its ML components. Ideally (though currently rarely happening in practice), identified and negotiated quality requirements for the system and the various components are explicitly documented to serve as a contract between teams. When teams cannot deliver components according to those contracts (or when system requirements change), those contracts and corresponding design decisions may need to be renegotiated.

Summary

When designing a system with ML components, desired system qualities and functionalities inform quality requirements for the various ML and non-ML components of the system, including machine-learned models, machine-learning pipelines, and monitoring infrastructure.

Data scientists have a large number of design decisions when training a model for a specific prediction problem that influence various qualities, such as prediction accuracy, inference latency, model size, cost per prediction, explainability, and training cost. When designing a production system, it is usually necessary to pay attention to many quality attributes, not just model accuracy. Various stakeholders, including software engineers and data scientists, typically have flexibility in negotiating requirements and the language of design space exploration (constraints, trade-offs) can help to identify requirements and facilitate negotiation between different stakeholders about component responsibilities.