# Frequently asked questions

## Contents

- 1 How can I run the model?
- 2 How can I run a single sub-model?
- 3 What language is the model implemented in?
- 4 Why is the model implemented in MATLAB?
- 5 Is MATLAB absolutely required?
- 6 Can I use an older version of MATLAB? Do I really need MATLAB = 2009b?
- 7 What platform is required to run the software?
- 8 Is the knowledge base software required?
- 9 I'm trying to install the old WholeCellKB PHP software. However, the sql is incompatible with more recent version of MySQL. How can I use the software with more recent versions of MySQL?
- 10 What is the easiest way to use the knowledge base?
- 11 What are the requirements to run the whole-cell virtual machine?
- 12 How much computing resources are required to run a single simulation?
- 13 How complete is the model?
- 14 How were the antibiotics in the model curated?
- 15 Where can I find the references (e.g. PUB_XXXX)?
- 16 Can I download the simulations published in Karr et al., 2012?
- 17 Is there variation within the simulation initial conditions?
- 18 How can I set the initial conditions?
- 19 What does it mean to "bound" a reaction?
- 20 How are the reaction rates calculated throughout the model? For example, please explain Equation S20.
- 21 What is the structure of the transcriptional regulatory network?
- 22 Why are the copy numbers at the end of the cell cycle on average 2X that at the beginning of the cell cycle?
- 23 What are v0 and μ0 in Equation S15 of Karr et al., 2012?
- 24 What is vb in Algorithm S9 of Karr et al., 2012?
- 25 How were reaction kinetics curated?
- 26 How are Km and Keq used in the model?
- 27 How are the sub-models implemented?
- 28 How are the sub-models synchronized?
- 29 What is the purpose of the submodel "side effects"?
- 30 How were the single-gene disruption strains categorized in Figure S2 of Karr et al., 2012?
- 31 How were Csite and Cstate in Section 3.19 of Data S1 of Karr et al., 2012 fit?
- 32 How were the RNA polymerase transition probabilities chosen?
- 33 Are all 525 genes included in the model?
- 34 Is the model stochastic?
- 35 Can the stochasticity be turned off?
- 36 How many simulations are needed to sample the ensemble of cells?
- 37 What do instances of the Simulation class represent?
- 38 How can I seed the random number generator?
- 39 How can I obtain the metabolic submodel's stochiometry matrix?
- 40 How is reversibility encoded in the metabolic submodel?
- 41 How is reaction stoichiometry encoded in the stoichiometry (S) matrix?
- 42 How can I visualize the metabolic submodel's stoichiometry matrix?
- 43 How can I obtain the substrate and enzyme copy numbers for the metabolic submodel?
- 44 How can I obtain the predicted metabolic reaction fluxes?
- 45 Where can I view UML diagrams of the whole-cell software?
- 46 Where can I view documentation of the whole-cell model software?
- 47 How can I edit the structure of the model (e.g. add a gene, protein, etc.)?
- 48 How is the simulation output structured?
- 49 What are the units for the cell volume?
- 50 How can I animate a simulation?
- 51 I'm trying to run the old WholeCellKB PHP software. How can I increase the PHP memory? How can I display errors?

## How can I run the model?

Please see the tutorial.

## How can I run a single sub-model?

Please see the tutorial.

## What language is the model implemented in?

The simulation was written in MATLAB. The training data was organized in a MySQL database and edited with a UI implemented in Python. We used perl, shell, java, javascript, actionscript to run jobs on our cluster and for various other small things.

## Why is the model implemented in MATLAB?

We used MATLAB primarily because it has great support for scientific computation and because its easy to develop code rapidly in MATLAB. As of R2008a MATLAB has good OOP support; without which this wouldn't have been possible. MATLAB is also reasonably fast, comparable to Java or Python.

## Is MATLAB absolutely required?

Practically speaking MATLAB is needed to participate in the competition.

If you absolute insist on not using MATLAB, you can run the simulation using the free MATLAB Runtime (MCR):

- Download the compiled binary (R2012b, 64 bit, Linux) at at SimtTK.
- Use the online configurator generate an XML simulation configuration.
- Run simulation

bin/simulateHighthroughputExperiments/run_simulateHighthroughputExperiments.sh /path/to/runtime \ seed 1 \ parameterValsPath /path/to/parameterValsPath.xml \ outPath /path/to/output.mat

## Can I use an older version of MATLAB? Do I really need MATLAB = 2009b?

We recommend using the latest version of MATLAB possible, as MATLAB's performance has improved significantly over the last several versions.

That said, it is possible to use MATLAB 2008a-2009a. You'll need to replace every "~" with a variable name eg. "junk". There may be a small number of other minor things to correct for 2009a. Substantial recoding would be needed to use MATLAB < 2008a as modern object-oriented programming wasn't supported at that time.

After eliminating the tildes run the tests to verify compatibility:

runSmallTests(); runMediumTests();

## What platform is required to run the software?

No specific platform is required. MATLAB is available for Windows, Mac, and Linux. The model has been run successfully on all three platforms.

## Is the knowledge base software required?

No. The knowledge and its web interface simply provide a nice user interface to view and edit the model's parameters. These parameter values have been cached in a .mat file (data/Simulation_fitted.mat) and can easily be set programmatically in MATLAB (see competition instructions). The public version of WholeCellKB is freely available to browser the parameter values.

## I'm trying to install the old WholeCellKB PHP software. However, the sql is incompatible with more recent version of MySQL. How can I use the software with more recent versions of MySQL?

Replace TYPE=INNODB with ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci in /path/to/wholecell/knowledgebase/lib/biowarehouse/warehouse-mysql-create.sql

## What is the easiest way to use the knowledge base?

The easiest way to view the knowledge base is to visit http://www.wholecellkb.org. The easiest way to edit the knowledge base is to use the virtual machine posted on SimTK, which already has the knowledge base software installed and properly configured. Detailed installation instructions are available at SimTK if you wish to install the knowledge base software on your own machine.

## What are the requirements to run the whole-cell virtual machine?

At a minimum you need a 64bit computer and ~2GB RAM. You can edit the virtual machine's setting while it is powered off to reduce the RAM and video RAM settings. However, some model analysis (not the dynamical simulation itself) may require additional RAM.

## How much computing resources are required to run a single simulation?

Simulating 1 life cycle of 1 in silico cell requires approximately:

- 24 hours on a single core
- 1 GB RAM
- 110 MB disk space for an "in insilco high-throughput experiment", 0.5 GB to store the entire simulated dynamics

The simulation is not parallelized. Each simulation uses only a single core. A cluster can be used to run multiple simulations simultaneously.

## How complete is the model?

As comprehensive as the model is, the model is not 100% complete. The model is considered a draft. For example, the model doesn't predict protein structure, and therefore can't predict how protein function is affected by sequence modified. The focus of the model was on high-level phenomena, particularly how network properties arise from the level of molecules. We started from a basic, but incomplete description of those molecules including kinetic rate constants.

## How were the antibiotics in the model curated?

The antibiotics in the model were curated mostly from DrugBank and several papers. See Table S3Q and S3AQ of Karr et al., 2012 for more information. There are two main classes of antibiotics in the model: ribosome and topoisomerase inhibitors. Both are very crudely modeled. If antibiotics are present, the ribosomes/topoisomerases are marked as non-functional so that translation cannot proceed.

## Where can I find the references (e.g. PUB_XXXX)?

The references are listed in Table S3S of Karr et al., 2012. You can download the table from SimTK

Reference #1 corresponds to the row labeled PUB_0001, #2 to PUB_0002, etc.

Alternatively, you can use WholeCellKB to find the references, eg. http://www.wholecellkb.org/detail/Mgenitalium/PUB_0928. Simply search for references with the name PUB_xxxx, substituting xxxx for the reference number and zero-padded to four digits.

## Can I download the simulations published in Karr et al., 2012?

The full trajectories of the simulations published in Karr et al., 2012 are available here.

## Is there variation within the simulation initial conditions?

Yes. The cell mass is sampled from a Gaussian distribution controlled by the cellInitialMassVariation parameter of the CellMass class. The RNA and protein copy numbers are sampled from a multinomial distribution. The initial conditions are calculated by the initializeState method of the Simulation class.

## How can I set the initial conditions?

- Generate initial conditions from previous simulation

ic = sim.getTimeCourses()

- Set the initial conditions and run a new simulation

simulateHighthroughputExperiments('initialConditions', ic);

Note: The initial conditions are not error checked. You need to make sure to update the synchronized parts of the cell state correctly, for example the copy numbers and positions of DNA-bound proteins.

## What does it mean to "bound" a reaction?

Reactions fluxes are bound in the flux-balance analysis/constrained optimization sense. Reactions with observed kinetics are bounded by the product of the kcat and enzyme copy number. Reactions whose kinetics have not been observed are present in the model but their fluxes are unbounded, meaning their fluxes can take any value.

## How are the reaction rates calculated throughout the model? For example, please explain Equation S20.

A common issue we faced when developing the model was how to simulate groups of reactions which were competing for the same enzymatic and metabolic resources. Most models ignore this issue. For example, cell signaling models typically focus on protein state and ignore where ATP comes from. Similarly, metabolic models typically don't account for that fact that any one enzyme molecule can only participate in one reaction at time. We decided to handle the issue by calculating upper bounds on reaction rates, in part because it was difficult to find Km data at the scale we needed whereas kcat data was more readily available.

Considering enzyme kinetics (kcat) and expression (e) alone, you can place an upper bound on a reaction's flux: v = e/kcat. Similarly, if you consider the copy numbers of metabolites (mi) alone, you can bound a reaction's flux: v = mini { mi / Mi} where Mi is the stoichiometry of metabolite i in the reaction.

You can do something similar for macromolecule substrates. The only difference is that in the reactions that we considered their stoichiometry was always 1, so the term simplifies: v = p, where p is the copy number of the RNA/protein substrate in the reaction (the RNA/protein thats being modified).

So thats where the three terms come from. You can put them together, by taking their minimum, to calculate a bound on a reaction's flux. The rest of the notation has to do with keeping tracking of multiple reactions at once. For enzymatically catalyzed reactions these terms simplify to the product of the kcat and the protein copy number.

The last piece is that you need to sample the enzyme bound because were only considering whole numbers of reactions. Otherwise you'll systematically underestimate the reaction rate. For example if e/kcat=0.1. If you don't round then you can only do 0 reactions per second, and the reaction never proceeds. But what you would like to see is that the reaction happens on average once every 10 seconds. Sampling gives you this property.

Once you can calculate rates, that rest of the model is sampling from the rate distribution, updating the cell's state once reactions are picked, updating the rates, and repeating until the end of the time step. This roughly follows the logic of Gillespie's algorithm. However, where possible we simplified the algorithm to reduce the run-time.

## What is the structure of the transcriptional regulatory network?

The structure of the regulatory network is described in Table S3P of Karr et al., 2013.

## Why are the copy numbers at the end of the cell cycle on average 2X that at the beginning of the cell cycle?

The reason why you expect the copy number to double on average is that we're simulating the life cycle of cells. This means that the end of the simulations corresponds to the same point in the cell cycle as the beginning (excluding partitioning the contents of the mother cell into two daughters). Because we're simulating relatively short time scales compared to evolution (i.e. we're assuming the cells are not evolving), the mother cells and daughter cells should be statistically identical. One consequence of this is that the average copy number of every molecule should double by the end of the simulation. You can think of this as a pseudo-periodic constraint on the model's parameter values. To be clear, this is a constraint on the parameter values which is evaluated during model fitting for an in silico population. The individual in silico cells themselves are not subject to any such constraint.

## What are v0 and μ0 in Equation S15 of Karr et al., 2012?

v0 and μ0 are the initial (pre model fitting) FBA solution calculated using the initial average enzyme expression which we reconstructed based on the reported literature. μ* is the observed growth rate, also based on experiments. In practice we find that μ0 ≠ μ*, so we want to edit the enzyme expression such that the FBA algorithm will predict the observed growth rate. We do this using MOMA so that the flux distribution is changed minimally while correcting the growth rate. First, we find a nearby flux distribution. Then we divide the calculated fluxes by the reaction kcats to get the adjusted enzyme expression. Note: The cases for S15 and S16 in the published version of Data S1 should be flipped. If μ0 < μ* the bounds need to be relaxed by at most a factor of μ*/μ0. Also S15 and S16 should say argmin not argmax.

## What is vb in Algorithm S9 of Karr et al., 2012?

This is the normal FBA algorithm. The growth flux (μ) is a component of the flux vector, v. Specifically it is the component with the subscript b. After you calculate optimal flux distribution, v*, you can extract the growth flux, v*b.

## How were reaction kinetics curated?

The kcats were curated from SABIO-RK, BRENDA, and journal articles. Rate constants were curated for many, but not all metabolic reactions.

## How are Km and Keq used in the model?

Km and Keq are not used in the model. We curated this information for future use. Keq=0 indicates that the value is unmeasured.

## How are the sub-models implemented?

The main idea behind the model is to combine various individual modeling techniques together, of which ODEs is one technique. The metabolic model is implemented using flux balance analysis. The transcription model includes a Markov model. Most of the other submodels are stochastic (you can think of these as roughly like the Gillespie algorithm optimized for specific cases). Several submodels implement Boolean rules.

## How are the sub-models synchronized?

The evolveState method of the Simulation class synchronizes the sub-models in two ways:

- The submodels aren't executed in parallel. Instead they're executed in a random order chosen at each time step. Each process reads the current cell state, does some computation, and edits the cell state. This avoids needing to resolve conflicts among processes editing the same parts of the cell state.
- The metabolite copy number state is heavily edited by multiple processes, particularly common molecules like ATP and water. To avoid order dependent effects where the first process which is executed could consume all ATP, the pool of each metabolite is allocated among the processes. Then the processes can only read and write this allocated fraction of the pool of each metabolite. I'm working on generalizing this idea to all the states. This would truly isolate the processes from each other in line with our mathematical assumptions and allow you to execute the submodels in parallel. The challenge is how to extend the allocation idea to more complicated states such as the chromosome. Any suggestions here would be helpful.

## What is the purpose of the submodel "side effects"?

The side effects is a way to encode the impact of a process on parts of the state that the process doesn't "see". This arises with processes that interact with the chromosome. For example, the replication sub-model binds DNA polymerase to the chromosome. This can result in displacing other proteins (which the replication sub-model doesn't "see") bound to the chromosome. The side effects are a way to keep track of this information. One benefit of this way of organizing the information is that it makes it very clear what the impact of a process is on states that it doesn't see.The side effects also keep the processes from having to communicate with every state (keeps the processes and states as isolated as possible).

## How were the single-gene disruption strains categorized in Figure S2 of Karr et al., 2012?

- Low growth: μ = 0.1 μWT
- Med growth: 0.1 μWT < μ < 0.8 μWT
- Hi growth: μ = 0.8 μWT

## How were Csite and Cstate in Section 3.19 of Data S1 of Karr et al., 2012 fit?

The parameters were manually fit. Essentially we tried different combinations, ran simulations, and checked which are consistent with the experimentally measured cell cycle length. We did this first with reduced simulations containing just the replication initiation sub-model (in this case we compared the time to replication to the observed cell cycle length minus the simulated replication time minus the simulation cytokinesis time) and then with the full model. We began the search from the Csite and Cstate values used in previous computational models.

## How were the RNA polymerase transition probabilities chosen?

We decided to build a markov model. However, the only data we available to train the transition probabilities was the observation distribution of states. As I'm sure you know, the steady-state distribution corresponds to the eigenvector with eigenvalue 1. So the one constraint we had on the transition probability matrix was that the eigenvector with eigenvalue 1 should be equal to the observed steady-state distribution. The second constraint was on the transition probability out of the elongating state (line 1012) to the non-specifically bound state. This was set based on the rate of RNA synthesis required for cell division with a ~9h doubling time. The third and fourth constraints is no active transcription ends in free RNA pol, and RNA pol only progresses to the active state through the specifically bound state. This sets three matrix elements to zero. Still there are too many free parameters for the number of constraints. Next, I lumped the free and nonspecifically bound states together (line 1038). This reduces the number of parameters from 12 to 6, leaving you with 6 parameters and 6 constraints. The code you're referring to implements this logic (line 1020-1027). However, we ultimately manually tuned the values because we had a lot of trouble tuning the model parameters.

## Are all 525 genes included in the model?

Mycoplasma genitalium has 525 genes, of which 401 (or their homologs) have been studied experimentally. Of these 401, 284 are essential and 117 are non-essential. We modeled the functions of each of these 401 genes. The last 124 genes have not been studied experimentally. Consequently, we couldn't model the functions of these genes. These genes are highly biased toward membrane and lipoproteins, and include many partial ABC transporter cassettes. These genes are included in the model. They are replicated, transcribed, and translated.

## Is the model stochastic?

Yes. The model is stochastic.

## Can the stochasticity be turned off?

No. Stochasticity arises naturally from random RNA and protein synthesis. The only explicit noise terms which can be turned off appear in the calculation of the initial conditions.

## How many simulations are needed to sample the ensemble of cells?

We recommend running at least 8 simulations to sample the ensemble.

## What do instances of the Simulation class represent?

Instances of the Simulation class represent the instantaneous configuration of one cell. Because the model is stochastic, the code can be run multiple times to simulate multiple cells.

## How can I seed the random number generator?

Execute the applyOptions method of the Simulation class:

sim.applyOptions('seed', 1);

## How can I obtain the metabolic submodel's stochiometry matrix?

There are a couple ways to obtain the matrix:

- Access the properties of the metabolic submodel

met = sim.process('Metabolism'); met.reactionStoichiometryMatrix %matrix met.substrateWholeCellModelIDs %row labels met.reactionWholeCellModelIDs %col labels {'c', 'e', 'm'} %depth labels

- Export the stoichiometry matrix to Lindo .ltx format which can be read by the free program lpSolve

sim.process('Metabolism').formulateFBA([], [], false); edu.stanford.covert.cell.sim.util.FbaLPWriter.write_lindo(sim, 'path/to/outFile.ltx');

## How is reversibility encoded in the metabolic submodel?

Reversibility is encoded in the flux bounds (reactionBounds property, first col: lower bound, second col: upper bound, rows: reactions). Negative infinity lower bound and positive infinity upper bound indicates reversible. Zero lower bound and positive infinity upper bound indicates irreversible.

## How is reaction stoichiometry encoded in the stoichiometry (S) matrix?

The S-matrix is encoded in the same way as flux-balance analysis. The left-hand side has negative coefficients. The right-hand side has positive coefficients.

## How can I visualize the metabolic submodel's stoichiometry matrix?

We recommend using GraphViz to visualize the matrix. Several packages are available for use with MATLAB and R:

- MATLAB - GraphViz interface
- graphviz4matlab
- Rgraphviz

lpSolve is a useful packages for reading .ltx files. Bindings are available for both MATLAB and R.

## How can I obtain the substrate and enzyme copy numbers for the metabolic submodel?

The substrate and product copy numbers are stored by the counts property of the Metabolite class. The wholeCellModelIDs property stores the row labels. The columns correspond to different compartments, the first of which is the cytosol. Table S2E of Karr et al., 2012 lists the average predicted metabolite concentrations across a population of 128 wild-type cells.

## How can I obtain the predicted metabolic reaction fluxes?

The computed fluxes are stored in the fluxs property of the MetabolicReaction class. Positive indicates net flux in the forward direction. Negative indicates net flux in the reverse direction (only possible for reversible reactions).

## Where can I view UML diagrams of the whole-cell software?

Unfortunately, UML diagrams are not available. The m2html-generated documentation does include call diagrams.

## Where can I view documentation of the whole-cell model software?

Two automatically generated documentations are available:

- m2html
- doxygen

## How can I edit the structure of the model (e.g. add a gene, protein, etc.)?

The easiest way to edit the structure of the model to create a new subclass of SimulationRunner. See edu.stanford.covert.cell.sim.sim.runners.Repressilator for a complete example. Use the following to run a simulation with the modified network structure:

runSimulation('runner', 'Repressilator');

## How is the simulation output structured?

The complete output is stored under a separate directory for each simulation as a set of files with the names state-xxx.mat as well as several .mat containing important metadata.

Each state-xxx.mat file corresponding to 100 s of simulation. Within each file the counts property of the metabolite state will have dimensions ~600×6×100 for 600 molecules x 6 compartments (cytosol, membrane, etc.) x 100 s.

randStreamStates.mat contains the state of the random number generator at each time point of the simulation.

options.mat, parameters.mat, and fittedConstants.mat contain the values of the model's options and parameters.

metadata.mat contains important metadata including the name of the user who ran the simulation, the time when the simulation was run, and more.

summary.mat is a separate simulation log created by SummaryLogger.

## What are the units for the cell volume?

The units for the cell volume are liters (L).

## How can I animate a simulation?

- To generate our newest visualizations use WholeCellViz
- To generate our older animation published in Cell:

#import import edu.stanford.covert.cell.sim.analysis.OverviewAnimation #configure output outputPath = 'output'; simSubFolder = 1; animationFile = 'animation.avi'; #run simulation runSimulation('outDir', fullFile(outputPath, num2str(simSubFolder)), 'logToDisk', true) #make animation animator = OverviewAnimation(outputPath, simSubFolder, animationFile); animator.run();

## I'm trying to run the old WholeCellKB PHP software. How can I increase the PHP memory? How can I display errors?

Add/edit the following line to your PHP configuration file (e.g. /etc/php.ini on CentOS):

memory_limit = 256M error_reporting = E_ALL & ~E_NOTICE