Basic Virus Model Fitting

Overview

This app illustrates how to fit a mechanistic dynamical model to data and how to use simulated data to evaluate if it is possible to fit a specific model.

Learning Objectives

The Model

Model Overview

The underlying model that is being fit to the data is the Basic Virus Model used in the app of that name (and some of the others). See that app for a detailed description of the model.

Model Diagram

The model is represented by this flow diagram:

Model Diagram

Model Diagram

Model Equations

The model is implemented as a set of ordinary differential equations:

\[\begin{aligned} \dot U & = n - d_U U - bUV \\ \dot I & = bUV - d_I I \\ \dot V & = pI - d_V V - gb UV \end{aligned}\]

Model Concepts

Data

For this app, viral load data from patients infected with influenza is being fit. The data is average log viral titer on days 1-8 post infection. The data comes from (Hayden et al. 1996), specifically the ‘no treatment’ group shown in Figure 2 of this paper.

Our simulation acts as another source of ‘data’ by producing artificial data.

Fitting Models

This app fits the log viral titer of the data to the virus kinetics produced by the model simulation. The fit is evaluated by computing the sum of square errors between data and model for all data points, i.e., \[ SSR= \sum_t (Vm_t - Vd_t)^2 \] where \(Vm_t\) is the virus load (in log units) predicted from the model simulation at days \(t=1\cdots8\) and \(Vd_t\) is the data reported in those same units (log10) and on those time points. The underlying code varies model parameters to try to get the predicted viral load from the model as close as possible to the data by minimizing the SSR. The app reports the final SSR for the fit.

In general, with enough data, one could fit/estimate every parameter in the model and the initial conditions. However, with just the virus load data available, the data are not rich enough to allow estimation of all model parameters (even for a model as simple as this). The app is therefore implemented by assuming that most model parameters are known and fixed, and only 3 (the rate of virus production, p, the rate of infection of cells, b, and the rate of virus death/removal, dV) are estimated. In general, the choice of fixing some parameters needs to be made based on what makes sense. If you have good estimates for some parameters from outside sources, you can justify fixing them. If not, then you might need to get more data or simplify your model until you can estimate all parameters you want/need to estimate. See the tasks for more on that.

Computer Routines for Fitting

A computer routine does the minimization of the sum of squares. Many such routines, generally referred to as optimizers, exist. For simple problems, e.g., fitting a linear regression model to data, any of the standard routines work fine. For the kind of minimization problem we face here, which involves a differential equation, it often makes a difference what numerical optimizer routine one uses. R has several packages for that purpose. In this app, we make use of the optimizer algorithms called COBYLA, Nelder-Mead and Subplex from the the nloptr package. This package provides access to a large number of optimizers and is a good choice for many optimization/fitting tasks. For more information , see the help files for the nloptr package and especially the nlopt website.

Notes

What To Do

The model is assumed to run in units of days.

Task 1

Start with 1e6 uninfected cells, no infected cells, 1 virion (assumed to be in the same units of the data, TCID50/ml), no uninfected cell birth and deaths, lifespan of infected cells 12 hours (make sure to convert to a rate), and unit conversion factor to 0. Set virus production rate to 0.001 with lower and upper bounds of 0.0001 and 100; psim can be anything for now. Set the infection rate to b = 0.1 with lower/upper bounds of 0.001/10; bsim can be anything. Set the virus decay rate to 1 with lower/upper bounds of 0.01/100; dVsim can be anything.

The parameters p, b and dV are being fit, the values we specify here are the starting conditions for the optimizer and the upper and lower bounds which the parameters must remain inside. Note that if the lower bound is not lower than / equal to and the upper not higher than / equal to the parameter, you will get an error message when you try to run the model. We can ignore the values for simulated data for now; to do so, set usesimdata to 0. This also means the noise variable is ignored, i.e., you can set it to anything.

Start with a maximum of 1 iteration/fitting step for the optimizer and solvertype = 1. Choose to plot the y-axis on a log scale using either ggplot or plotly. Run the simulation.

Since you only do a single iteration, nothing is really optimized. We are just doing this so you can see the time-series produced with the starting conditions we picked for the parameters. Notice that the virus load predicted by the model and the data are already fairly close. Look at the SSR value reported underneath the plot. It should be 3.25. As we fit, if things work ok, this value will go down, indicating an improved fit.

Now choose iter = 20, i.e., fit for 20 iterations. Run the simulation. Look at the results. The plot shows the final fit. The model-predicted virus curve will be closer to the data. Record the SSR value; it should have gone down indicating a better fit. Also, printed below the figure are the values of the fitted parameters at the end of the fitting process. They should differ from the values you started with.

Record

Task 2

Repeat the same process, now fitting for 40 iterations. You should see some more improvement in the SSR, meaning a further improvement in fit. A fitting step or iteration is essentially a try of the underlying code to find the best possible model. Increasing the tries/iterations usually improves the fit until the solver reached a point where it can’t improve further. This is the best fit, though see the next tasks for a big caveat. In practice, one should not specify a fixed number of iterations; that is just done here so things run reasonably fast. Instead, one should ask the solver to run as long as it takes until it can’t find a way to further improve the fit (in our case can’t further reduce the SSR). The technical expression for this is that the solver has converged to the solution. In this example, it happens quickly. Keep increasing iterations until you find no further reduction in SSR.

Record

Task 3

The goal is to find the best fit, the one with the lowest SSR posssible (for a given model and data set). While in theory, there is always one (or maybe more than one) best fit, for the kind of models we are fitting here, it is often hard to numerically find this best fit (that’s a big difference compared to many standard statistical models where the numeric routine is fairly straightforward).

To explore this, start with the same input values as for task 1, run the optimizer for 40 steps but now use solver/optimizer type 2. Then, repeat for type 3. You should find that solver 2 is at an SSR after 40 steps that’s above solver 1 while solver 3 found a value that’s below what type 1 is able to find.

Record

Task 4

We just saw that after a few iterations, different numerical solvers/optimizers obtain different results. That’s maybe not surprising and we would not care much about that (apart from trying to find the fastest one), as long as they can all find the same best fit given enough time/iterations. Unfortunately, this doesn’t always happen. Let’s, explore this. Run each solver for 1000 iterations, then 2000. Depending on the speed of your computer, this might take a while, so be patient. You should find that each solver produces the same SSR for 1000 and 2000 steps, so each solver has converged (reached what it thinks is the overall best fit). Unfortunately, the solvers don’t agree on what the best fit is. In this example, solver 2 performs best, with an SSR slightly below 1. This illustrates the challenge when fitting the kind of models we have here, namely that we often have to explore different solvers (and starting conditions, we’ll get there) to be reasonably sure we can find the overall best fit. This general problem gets worse the more parameters you try to fit.

Record

Task 5

To explore the role that different starting values for the estimated parameters can have, set p=0.01, b=0.01, and dV=10. Run a single iteration with solver 1. SSR should be 3.54, similar but not the same as for the start of task 1.

Now, run each of the 3 solvers for 1000 iterations. You’ll find that all of them are able to find better fits than previously. Note that the scientific question has not changed: the model and the data are the same and we are still trying to find the combination of parameter values that give the best agreement between model and data. All we have changed is giving the optimizers a different starting position from which to go and find the best fit. In many basic statistical models, you can start anywhere and always arrive at the same answer. This is not the case here. For the kind of fitting problem we usually encounter with our simulation models, optimizers might not find the overall best fit, even if you run them until they converge. What can happen is that a solver converges to a local optimum, which is a combination of parameters that produces a good result, and, as the solver tries to change the parameter values, each result is worse; so, it determines it found the best fit. Unfortunately, there might exist a completely different combination of parameters that give an even better fit, but the solver can’t find that solution. It is stuck in a local optimum. Many solvers, even so-called ‘global’ solvers, can get stuck. Unfortunately, we never know for certain if we found the best fit, and there is no one solver type that is best for all problems. So, in practice, what one needs to do is try different starting values and different solver/optimizer routines, and if many of them find a best fit with the lowest SSR, it’s quite likely (though not guaranteed) that we found the overall best fit (lowest SSR).

Record

Task 6

One major consideration when fitting these kind of mechanistic models to data is the balance between data availability and model complexity. The more and richer data one has (e.g., measurements for multiple model variables) the more parameters one can relibably estimate and therefore the more detailed a model can be. If one tries to ask too much from the data, it leads to the problem of overfitting, i.e., trying to estimate more parameters than can be robustly estimated for a given dataset (this is also known as the identifiability problem). This can show itself in the problems we’ve seen above, the solvers getting stuck in various local optimums, or finding more or less the same SSR for large ranges of parameters. While there are mathematical ways to test if a model can be estimated given the data, this only works for simple problems and doesn’t consider the fact that most data are noisy. The most general, and usually best approach to try and safeguard against overfitting is to test if our model can in principle recover estimates in a scenario where parameter values are known. To do so, we can use our model with specific parameter values and simulate data. We can then fit the model to these simulated data. If everything works, we expect that, ideally independent of the starting values for our solver, we end up with estimated best-fit parameter values that agree with the ones we used to simulate the artificial data. We’ll try this now with the app.

Set everything as in task 1 (i.e., hit the Reset Inputs button). Then set psim = 0.001, bsim = 0.1, and dVsim = 1, i.e., to the same values as the values used for starting the fitting routine (p/b/dV). Set iteration to 1, solver type 1, y-axis on log scale. Run the simulation. It should be the same as in task 1. Take a close look at the data in the plot. Then set usesimdata = 1, run again for 1 fitting step. You should now see that the data has changed. Instead of the real data, we now use simulated data. Since the parameter values for the simulated data and the starting values for the fitting routine are the same, the time-series is on top of the data and the SSR is (up to rounding errors) 0.

Record

Task 7

Now we can test if we have a problem with overfitting (trying to estimate too many parameters given the data) by choosing starting values for our fitting routine that are different from those we used to produce the simulated data. We hope that after enough iterations, the solver will reach a best fit with SSR close to 0 and estimates for the parameters that agree with the ones we used to produce the data.

Set p = 0.01, b = 0.01, dV = 2. Don’t change the psim/bsim/dVsim. Run simulation for 1 iteration. You’ll see, as expected, a mismatch between model and simulated data with an SSR = 2.2.

Now, run each solver for 500 iterations. You should find that solver 2 finds a best fit with SSR approximately 0 and the estimated parameters are those we used to generate the data. Solvers 1 and 3 can’t quite find it. (You can try more iterations, but it seems those solvers are stuck. They might find the best fit with different starting conditions). This means that in this case, we seem to be able to estimate all parameters (as long as we play around with different solvers and starting conditions). You can play around with the values for the data simulation parameters and different starting values for the fitting routine. You’ll likely find that sometimes your solver can accurately determine the parameter values, sometimes not. If it works for at least some starting values and some solvers, it means you should be able to estimate your model parameters. Of course with the caveats described above, i.e., it might require a good bit of testing/searching.

Since real data are always noisy, there is an option to add noise to your simulated data by choosing a non-zero value for noise. Explore a bit what happens if you do that. The more noise you add, the more disagreement will you get between best fit estimates for the parameter values and what you used to simulate the data. This is to be expected, since your data is now not coming straight from the simulation, but instead gets modified by noise.

Record

Task 8

So far, we haven’t talked much about the parameter bounds. In theory, you shouldn’t need any bounds. If your model is a good one, you just start with some parameter values and your optimizer will find the best fit. As you already saw, in practice it’s not that easy. If you don’t give the optimizer any guidance as to what parameter values are reasonable, it might try all kinds of unrealistic values. This can slow down the fitting, it can make it more likely you get stuck in a local optimum, or your underlying simulation function fails because the optimizer is asking it to run the model for nonsensical values. All of these problems suggest that giving bounds for parameters is a good idea. Let’s explore the impact of bounds briefly.

Reset all inputs (that means we are switching back to the real data, i.e. usesimdata = 0). Then set p = b = dV = 0.1 as starting values. Run a single iteration to see that with these starting values, the model is far from the data (SSR = 91). Then, run 1000 interations of solver 2, you should find an okay fit with an SSR = 4.12.

Now, change lower bounds to 1e-6 and upper bounds to 1e6 for all 3 parameters. Re-run the fitting. You’ll find a much poorer fit with unreasonable parameter estimates. This is an example of how bounds help find the best fit. You can explore the impact of bounds further by changing their values, the starting values, the number of iterations and solver type. You’ll find that sometimes wide bounds are okay and don’t impact the fitting much, but, other times, things don’t work with wide bounds.

One problem can arise when the best-fit value reported from the optimizer is the same as the lower or upper bound for that parameter. This likely means if you widen the bounds the fit will get better. However, the parameters have biological meanings and certain values do not make sense. For instance a lower bound for the virus decay rate of 0.001/day would mean an average virus lifespan of 1000 days or around 3 years, which is not reasonable for flu in vivo. That means if your best fit happens at the bounds, or in general for parameter values that make no biological sense, then your underlying model needs to be modified.

Record

Task 9

The following point is not directly related to the whole fitting topic but worth addressing: without much comment, I asked you to set the unit conversion factor to 0. That essentially means that we think this process of virions being lost due to entering infected cells is negligible compared to clearance of virus due to other mechanisms at rate dV.

While that unit conversion factor shows up in most apps, it is arguably not that important if we explore our model without trying to fit it to data. But here, for fitting purposes, this might be important. The experimental units are TCID50/mL, so in our model, virus load needs to have the same units. To make all units work, g needs to have those units, i.e., needs to convert from infectious virions at the site of infection to experimental units. Unfortunately, how one relates to the other is not quite clear. (See (Handel, Longini, and Antia 2007) for a discussion of that.) If you plan to fit models to data you collected, you need to pay attention to units and make sure what you simulate and the data you have are in agreement. In general, we might probably want to fit g. Here, we briefly explore how things change if it’s not 0. Reset all inputs, then set g=1. Run a single iteration. You’ll find a very poor fit (SSR=786).

Run solver 2 for 1000 iterations. You’ll find something that looks better, but not great. Play around with the starting values for the fitted parameters (p/b/dV) to see if you can get an ok looking starting simulation. This is often required when you are trying to fit, since if you start with parameter values that place the model very far away from the data, the solvers might never be able to get close. Once you have a somewhat decent starting simulation, try the different solvers for different iterations and see if you can improve. One useful approach for fitting is to run for some iterations and use the reported best-fit values as new starting conditions, then do another fit with the same or a different solver. The best fit I was able to find was an SSR of a little bit above 4. You might be able to find something better. Since the fit is not good, it suggests the rest of the model (either the model structure, or choices for fixed parameters) is not good and requires tweaking.

Record

Task 10

Keep exploring. Fitting these kind of models can be tricky at times, and you might find strange behavior in this app that you don’t expect. Try to get to the bottom of what might be going on. This is an open-ended exploration, so I can’t really give you much guidance, other than to re-emphasize that fitting mechanistic simulation models is much trickier than fitting more standard (e.g. generalized linear) models. Just try different things, try to understand as much as possible of what you observe.

Record

Further Information

For this app, the underlying function running the simulation is called simulate_basicvirus_fit. That function repeatedly calls simulate_basicvirus_ode.

This app (and all others) are structured such that the Shiny part (the graphical interface you are using) calls one or several underlying R functions which run the simulation for the model of interest and return the results. You can call them directly, without going through the shiny app. Use the help() command for more information on how to use the functions directly. If you go that route, you need to use the results returned from this function and produce useful output (such as a plot) yourself.

You can also download all simulator functions and modify them for your own purposes. Of course to modify these functions, you’ll need to do some coding.

For examples on using the simulators directly and how to modify them, read the package vignette by typing vignette('DSAIRM') into the R console.

A good source for fitting models in R is (Bolker 2008). Note though that the focus is on ecological data and ODE-type models are not/barely discussed. This book (Hilborn and Mangel 1997) has nice explanations of data fitting, model comparison, etc. but is more theoretical. Lot’s of good online material exists on fitting/inference. Most of the material is explained in the context of static, non-mechanistic, statistical or machine learning models, but a lot of the principles apply equally to ODEs. A discussion of overfitting (also called ‘identifiability problem’) for ODEs is (Miao et al. 2011). Advanced functionality to fit stochastic models can be found in the pomp package in R. (If you don’t know what stochastic models are, check out the stochastic apps in DSAIRM.)

References

Bolker, Benjamin M. 2008. Ecological Models and Data in r. Princeton University Press.
Handel, Andreas, Ira M Longini Jr, and Rustom Antia. 2007. “Neuraminidase Inhibitor Resistance in Influenza: Assessing the Danger of Its Generation and Spread.” PLoS Computational Biology 3 (12): e240. https://doi.org/10.1371/journal.pcbi.0030240.
Hayden, F G, J J Treanor, R F Betts, M Lobo, J D Esinhart, and E K Hussey. 1996. Safety and Efficacy of the Neuraminidase Inhibitor Gg167 in Experimental Human Influenza. JAMA 275 (4): 295–99.
Hilborn, Ray, and Marc Mangel. 1997. The Ecological Detective : Confronting Models with Data. Monographs in Population Biology 28. Princeton, N.J.: Princeton University Press.
Miao, Hongyu, Xiaohua Xia, Alan S. Perelson, and Hulin Wu. 2011. “On Identifiability of Nonlinear ODE Models and Applications in Viral Dynamics.” SIAM Review 53 (1): 3. https://doi.org/10.1137/090757009.