**A Tutorial on Visual Predictive Checks**

Mats O. Karlsson & Nick Holford

Uppsala University and University of Auckland

The visual predictive check (VPC) is a model diagnostic that can be used to: (i) allow comparison between alternative models, (ii) suggest model improvements, and (iii) support appropriateness of a model. The VPC is constructed from stochastic simulations from the model therefore all model components contribute and it can help in diagnosing both structural and stochastic contributions. As the VPC is being increasingly used as a key diagnostic to illustrate model appropriateness, it is important that its methodology, strengths and weaknesses be discussed by the pharmacometric community.

In a typical VPC, the model is used to repeatedly (usually n≥1000) simulate observations according to the original design of the study. Based on these simulations, percentiles of the simulated data are plotted versus an independent variable, usually time since start of treatment. It is then desirable that the same percentiles are calculated and plotted for the observed data to aid comparison of predictions with observations. With suitable data a plot including the observations may be helpful by indicating the data density at different times and thus giving some indirect feel for the uncertainty in the percentiles. Apparently poor model performance where there is very sparse data may not as strongly indicate model inadequacy as poor performance with dense data. A drawback of adding all observations to the VPC, in particular for large studies, is that it may cloud the picture without making data density obvious. A possible intermediate route is to plot a random sub-sample of all observations.

The percentiles chosen for plotting are often the 5^{th}, 50^{th} and 95^{th} percentile. However, for small data sets, less extreme percentiles (e.g. 25^{ th} and 75^{th}) may be more appropriate. The stochastic components of the model will be of particular importance for predicting extreme percentiles. Thus such percentiles would be expected to act as useful diagnostics for that part of the model. At the same time, with small data sets, large differences between extreme percentiles based on simulated and observed data can occur by chance. The trade-off for suitable choice of percentiles, versus data set size and model component to diagnose has not been clearly outlined. However, identification of the 5^{th} percentile is normally based on 1000 values or more. Between subject variability is often the dominating variability component, and, therefore extreme percentiles such as the 5^{th} and 95^{th} maybe should probably be reserved for large studies (more than 500 subjects).

Stratification may be necessary to make a VPC illustrative when observations at the same value of the independent variable have different simulated distributions. Such differences will occur for: different response variables, different arms of a trial (placebo, varying doses, and active control), different routes of administration, different dose intervals, and whenever there are covariate relationships influencing any of the parameters of the model. However, stratification into many separate VPCs may also hamper diagnosis and a suitable balance has to be found. For some (small) data sets such a balance may be difficult to find and they may not be suitable for the VPC.

Binning is a procedure that may be necessary when observation times are heterogeneous between subjects. It involves grouping observations in different time intervals so that there are suitably large number of observed values can be used to define the percentiles of the observed distribution. If this is not performed, the resulting percentiles may be very noisy and give little helpful information. One possibility is to use nominal (protocol) times for all observations and simulations. This procedure assumes that observations and simulation distributions are similar for neighbouring times. An alternative is to allow all observations/simulations within a time interval to contribute to defining a prediction interval that is common for the time interval. Binning will always lead to a certain distortion of the relationship between model simulations and the observations. It may therefore be useful to assess the sensitivity of the resulting graph to the binning choices that have been applied.

Uncertainty in the discrepancy between the percentiles of the observed and simulated data can make it hard to conclude whether differences represents model misspecification or not. This can be investigated by inspecting the variability that would be expected due to random chance. The uncertainty in the prediction intervals from the simulated data is related to the number of simulations performed. Thus, the number of simulations should be sufficiently large for this variability not to play an important role. The observed data is finite sample and may therefore show deviation from the expected behaviour, even if the model is correct. A procedure for assessing how large random variations can be expected involves calculating prediction intervals for each of the simulated data sets. From a suitably large number of simulated data sets confidence intervals based on the replicate prediction intervals can then be calculated.

The VPC, like other simulation-based diagnostic, relies on the appropriateness of the simulations that are performed. This may be difficult when adaptive designs are being employed. If the criteria for all adaptations are defined prospectively, creating simulations is "only" a technical difficulty. If, however, adaptation of doses, dosing intervals, observation frequency and other treatment aspects are based on subjective judgment by the physician or the patient, it can be extremely difficult to perform adequate simulations.

Missing data and protocol violations represent other situations which can be challenging with respect to simulations. Censoring of pharmacokinetic data due to data below limit of quantification and censoring of pharmacodynamic data due to patient dropout are situations where often a model should be constructed for predicting the censoring. It such situations it is also important to assess that the number and timing of the censoring mimics that of the real data. If the model is adequate, the full distribution of uncensored data can be realistically recreated in the simulations.

The VPC is rapidly becoming one of the most important diagnostic tools available. Therefore it is important to fully understand when it is suitable and how it can be made as informative as possible, and to understand the decisions and trade-offs necessary when constructing it.