CellML Discussion List

Text archives Help


[cellml-discussion] Proposal: BCP for including external codeinCellML models


Chronological Thread 
  • From: ak.miller at auckland.ac.nz (Andrew Miller)
  • Subject: [cellml-discussion] Proposal: BCP for including external codeinCellML models
  • Date: Mon, 19 Mar 2007 11:17:14 +1200

Matt wrote:
> I have often thought referencing external code through a clearly
> defined interface would be useful, and mostly because procedural code
> is another natural way to solve problems. But I have always banged my
> head up against validation. With procedural code this amounts to
> passing tests - good tests - and being confident that the code will
> break in useful ways when it does break. I don't see this as being any
> different to the intended outcome of valid CellML models that are
> purely declarative.
>
> At first glance it might seem that it is more taxing for a developer
> wanting to use CellML in their application if they need to handle
> external code; but this proposal for external code is very specific to
> the math declarations, and I think independent of whether the math is
> represented in MathML or as an external source of procedural code, the
> decisions of an application that are investigating the math are going
> to be difficult without sufficient annotation that tries to classify
> the math formulations in a way that a machine can filter what it is
> capable of and not capable of processing.
I deliberately don't address how to let the tools associate certain
external code with a given code-identifier URI. Initially, this would
have to be tool specific, but there could be another specification for
this process in the future.
> In some cases I imagine the
> application developer would welcome a particular math problem being
> already coded in a language that could be compiled an run. If that
> thought is continued, then there is a place for a model representation
> that has all math represented by external code, with the model
> structure being represented in CellML. This would obviously be under
> the assumption that some particular decisions for simulation of the
> model had been made; it is indeed a different scenario from the pure
> declarative model that seeks to explain the mathematical problem at a
> higher level and leave it to applications to resolve the simulation
> from this.
>
> At the moment we don't actually have a useful way for providing a
> cellml model with enough machine readable information for someone to
> rerun our model in exactly the same we as we had.
It is getting closer. PCEnv has its own non-standard meta-data for the
exact algorithm used, but we haven't been able to agree on a more
general standard way to represent algorithms with gracious fallback yet.
> By referencing
> and/or including external code, we allow the step of exchanging a
> model at the simulation level, which is actually not a bad thing if
> our goal is to promote collaboration of model building.
>
I don't know if this is true, as I am not suggesting that we allow the
stepping (integrator) algorithm itself to be exchanged (although that
could be a different, future specification if needed).
> I do think there is a possibility that people would abuse this; i.e.
> jump straight to binding bits of code here and there together with
> CellML; but if we maintain standards and best practice, then it should
> be easy to show them up. Also, perhaps we should trust people to
> evolve to only resorting to external code if it absolutely is the best
> way to solve their problem.
>
> There are a couple of things that we could possibly lose by bringing
> in external code:
> 1) producing human readable equations for publication that accurately
> reflect the mathematics in the model. Annotation of the algorithms or
> maths in the external code would help, but would not guarantee that
> the publication reflected exactly what was encoded in the model.
>
It is better that we have some equations in MathML than no equations at
all. In the cases which my BCP document is targeting, you probably would
want to represent the model in the paper as a mixture of equations (from
the MathML) and pseudo-code (which would probably be hand-written). The
CellML and machine-readable code would ideally be referred to in a
repository, or provided as a supplement.
> 2) ease of creating machine readable annotation for parts of the
> external code that would require it - for example under MIRIAM to bind
> each 'component' of a model to the relevant part of a reaction
> network. This is where you would be questioning the modeler as to
> whether their external code should be broken down and spread across
> models. But they may not have control over the external source, or,
> perhaps they are exchanging models that have necessarily lumped a lot
> of biological concepts into one piece of external code(library)
> because it's more efficient to solve that way; we now have a non
> MIRIAM compliant model.
>
> I would like to think including linking to external code in the CellML
> specification would push us to make a bigger effort on the procedures
> for model validation, and get more encouraging involvement of various
> modelers sitting out there with code that works; rather than thinking
> we will somewhere lose some high level elegance of CellML.
>
I agree. Validation (i.e. testing) procedures for a specific class of
CellML models containing external code (those containing machine
learning techniques) is part of my PhD project.
> Specific comments (quoted pieces from
> http://www.cellml.org/Members/miller/bcp-external-models/ are enclosed
> in triple quotes)
>
> """[CellML] models are very good at describing complete mathematical
> models in a format which can be exchanged between model authors and
> users. This adds significant value to a model representation, because
> third parties can take the model, and use it in their preferred
> software packages to reproduce any results the author published."""
>
> Need some clear examples of model types that cannot be expressed in
> CellML, i.e. some algorithms that are best (or only) expressible at
> the moment in procedural code. I know that various neural network
> models and genetic algorithm based learning systems have evolved
> mainly from procedural thought. I think we need to really consider
> that some problems would be much better understood by model authors if
> they are expressed in procedural code.
>
I'm not sure that we need this in the document, because the document is
intended to provide best current practice guidelines. We could make a
tutorial document (perhaps when the tools are better developed)
describing examples of external code and how to make them work in the tools.

Examples:
1) Most machine learning techniques.
2) Stochastic sub-models, where there might not be a closed-form
mathematical equation to go from the behaviour of individual parts of
the model to the concentrations of each state.
3) Certain mathematical functions (e.g. those arising from various
combinatorial problems) do not have a closed form, and will require
external code to perform a numerical solve.

> """Having part of a model expressed in CellML, and other parts
> expressed in some more generic language is still useful, because it
> means that the common part of the model can be re-used more easily,
> either by providing external code of a different kind, or, where
> possible by replacing the external code with MathML."""
>
> If external code can be replaced with MathML, then why wouldn't this
> have been in a CellML component in the first place?
>
Because the original model might need to use external code, but someone
has proposed a new model which no longer needs it.
> I see a pro and con where someone encodes most of a model in an
> external code block bound into a single component of a model. The pro
> would be that maybe this has helped promote someone actually bothering
> to use cellml - as a first step, they simply wrapped their existing
> code; in this case it would be up to repository maintainers to
> encourage a breakdown of the model. The con of course is that we lose
> model structure into the external code, and there is no way we can
> automatically extract that. It is therefore effectively hidden until
> broken down - if that ever makes sense for the model.
>
My guidelines try to discourage hiding model structure in external code
when it is possible to do otherwise.

I agree that providing the tools to include external code would allow an
incremental migration of models from procedural to CellML code. If this
process is convenient for people, I don't see a problem with having
transitional (non-published) models. I think that the current document
encourages anyone doing this to complete the process to the maximum
extent possible.
> """It is also hoped that this specification will encourage model
> developers to build up libraries of CellML accessible external code,
> which can be re-used in a range of CellML models, therefore increasing
> the range of modelling techniques available to CellML model
> authors."""
>
> I would see an open library of external code being very useful. There
> would need to be clear grading of that code, for example validating
> that code even compiles(if it needs to) and run on x,y,z platforms.
>
There are lots of libraries like that out there already. Do you mean
something CellML specific? If so, I think that would be useful, but it
is a bit early now, because tools need to develop interfaces to external
code first, and start to standardise that a bit more. I don't think that
is a pre-requisite for agreeing on an external code document.
> """Best practice guidelines for CellML document authors"""
>
> """1. External code should be used only where a part of a model cannot
> be adequately expressed in CellML. External code is often
> non-portable, and using it reduces the re-usability of your model, and
> so it should only be used when needed."""
>
> yes
>
> """2. External code should only perform the calculations that CellML
> is unable to perform, with the rest of the calculations expressed as
> MathML, in the CellML model. This is important, because increasings
> the fraction of your model can be more easily re-used by other
> modellers. It also means that CellML editing and visualisation
> software will allow your model to be edited and visualised better."""
>
> yes and no. I don't think representing in MathML offers any more ease
> for re-use unless you are all sharing a prescribed subset of MathML
> and agree on the acceptable forms of equations if algebraic
> manipulation is limited or not possible.
>
We do have a CellML-subset of MathML (which functions you can call). I
agree that the declarative use of MathML is somewhat limited by the fact
that most (all?) CellML tools can't do any symbolic algebra. However, as
the very least it becomes possible to perform Newton-Raphson solves (you
could potentially do this with procedural code too, of course, but you
would at least want the code to be broken up into minimal functions so
you don't have to do a multivariate solve, as requested in 3). However,
just re-ordering equations will allow us some potential for re-use,
especially for simple equations like x = y, or linear combinations of
other variables (which it would be quite reasonable for CellML software
to be able to manipulate, should there be sufficient demand for this).

> """3. Modellers should, where feasible, separate external code into as
> many different sub-functions as possible. For example, if you have
> external code to compute y1 from x1 and x2, and y2 from x1 and x2, you
> should write this as two separate external function applications,
> unless there is a compelling reason to do otherwise (such as is the
> case if it is much more efficient to compute them together). Doing
> this makes it easier to modify the CellML model in the future, and
> allows the CellML processing software to determine the order in which
> expressions are evaluated, making your model more flexible."""
>
> see above ... the compromise will always be the amount of information
> you can extract out of the model for other purposes - for example for
> model reuse, for simply visualizing and understanding the makeup of
> the model, for publication. It could be compelling enough for people
> to produce at least one highly broken down model along with the one
> fitted for optimization.
>
> """4. External code should, by itself, meet [MIRIAM] requirements 1
> and 2. This means that the external code should be encoded in a
> public, machine-readable format, and it should be valid and
> compilable."""
>
> It should meet all the criteria of MIRIAM compliance as part of being
> a model on the whole.
Together with guideline 5, that is still implicitly required. I added
this guideline in because we don't want to encourage people to publish
their CellML XML, but not include the external code in an adequate
format, and then claim that they have shared their model.

The external code by itself isn't a model, so can't comply with the rest
of the MIRIAM requirements. If people are publishing the model, the
model as as a whole should be MIRIAM compliant, but that is a general
concern, and is not specific to this document (hence it wouldn't make
sense to give it as a guideline). I have tried to choose the
best-practice guidelines so that modellers who follow them, as well as
the CellML specification, will create a maximally useful, MIRIAM
compliant model. I think that the goal as it is now is useful for this.

> The test cases are going to be very important I
> think in assuring the quality of external code.
I'm not sure that testing the external code in isolation would always be
useful, simply because there may not be any data about what inputs the
external code should produce for a given output (i.e. the external code
can only be tested by how well it works in the context of the entire model).
> You might make the
> case the external code is wrapped in its own model which itself would
> need to be fully MIRIAM compliant. The MIRAM document is a bit weak
> around the edges of things like validation and the annotation of
> 'components' of a model.
Hence why we need this guideline, which essentially requests that
external code is written in a real programming language, and is valid
under the rules of that programming language.
> I think we need to be clear about what
> validation is necessary for models that reference external code.
>
This is no different from the standard rules for validation (although we
do of course need to be careful in the machine learning case that we are
learning biology and not the same data we are testing with. However,
this is a concern for another document, and could still apply to pure
CellML models).
> I would still like more clarification of how important MIRIAM is to
> this; especially in that I think the requirements of MIRIAM haven't
> really been designed with typical procedural code examples in mind. I
> don't think MIRIAM can't cope with it.
>
> """5. The external code should be treated as part of the model. When a
> model represented in CellML is published, the external code should be
> published alongside it, unless it is part of a generally available
> library of external code."""
>
> The latter part worries me a little. Enter license bewilderment. But see 6.
>
> """6. The definitionURL used on csymbol elements should be a URL under
> the control of the author. It is not necessary for there to actually
> be a document accessible at the URL, as it is merely intended as a
> unique identifier."""
>
> What happens with multiple authors? Will an author always guarantee a
> method for creating a URL? I think this problem is related to 5. For
> example, if the source code for an external component is submitted to
> a repository and becomes licensed according to that, then the URL
> should probably be related to that. So I think ultimately the domain
> that wants to guarantee that the source is perpetually available
> should be the domain that forms the base of the URL.
>
I meant author of the procedural code, not author of the CellML model.
If the author of the procedural code doesn't provide a URL, the model
author may need to make one under their control. I guess it would be
worth clarifying that more.

If the code has been written by one author, and then subsequently taken
and modified by another, it should be the second author who specifies
the URL, because we want to identify the modified version, not the original.

The system is intended to simply provide a unique URI for the code.
There is not supposed to be any implication that the code can be fetched
from any location. The reason for requiring the domain be under the
control of the 'author' is simply to avoid collisions.

Perhaps:
"The definitionURL used on csymbol elements should be a URL which
uniquely identifies the external code being used. It is not necessary
for there to actually be a document accessible at the URL, as it is
merely intended as a unique identifier. If no existing URL has been
assigned for the particular version of the external code being
referenced, a new URL may be assigned. To avoid collisions, the person
or entity assigning a URL should choose a URL under their control."

That way, if a repository allocates URLs for external code, model
authors could use that URL, which would help to ensure that commonly
used external code has a single well-known URL.
> cheers
> Matt
>
>





Archive powered by MHonArc 2.6.18.

Top of page