CellML Discussion List

Text archives Help


[cellml-discussion] Proposal: BCP for including external codeinCellML models


Chronological Thread 
  • From: matt.halstead at auckland.ac.nz (Matt )
  • Subject: [cellml-discussion] Proposal: BCP for including external codeinCellML models
  • Date: Mon, 19 Mar 2007 15:02:49 +1200

On 3/19/07, Andrew Miller <ak.miller at auckland.ac.nz> wrote:
> Matt wrote:
> > I have often thought referencing external code through a clearly
> > defined interface would be useful, and mostly because procedural code
> > is another natural way to solve problems. But I have always banged my
> > head up against validation. With procedural code this amounts to
> > passing tests - good tests - and being confident that the code will
> > break in useful ways when it does break. I don't see this as being any
> > different to the intended outcome of valid CellML models that are
> > purely declarative.
> >
> > At first glance it might seem that it is more taxing for a developer
> > wanting to use CellML in their application if they need to handle
> > external code; but this proposal for external code is very specific to
> > the math declarations, and I think independent of whether the math is
> > represented in MathML or as an external source of procedural code, the
> > decisions of an application that are investigating the math are going
> > to be difficult without sufficient annotation that tries to classify
> > the math formulations in a way that a machine can filter what it is
> > capable of and not capable of processing.
> I deliberately don't address how to let the tools associate certain
> external code with a given code-identifier URI. Initially, this would
> have to be tool specific, but there could be another specification for
> this process in the future.

Right. I don't think your specification needs to address this; I think
it is a more general problem of model exchange between software
systems where at present there can be quite a bit of work required by
the system to figure out if it can interpret the math. At the moment
the model interchange we know of - PCEnv, COR, JSim - all interpret
models that are systems of ordinary differential equations and mostly
modelling in similar ways the same kinds of biological processes. This
is very convenient. I don't think the external code proposal
complicates this.

> > In some cases I imagine the
> > application developer would welcome a particular math problem being
> > already coded in a language that could be compiled an run. If that
> > thought is continued, then there is a place for a model representation
> > that has all math represented by external code, with the model
> > structure being represented in CellML. This would obviously be under
> > the assumption that some particular decisions for simulation of the
> > model had been made; it is indeed a different scenario from the pure
> > declarative model that seeks to explain the mathematical problem at a
> > higher level and leave it to applications to resolve the simulation
> > from this.
> >
> > At the moment we don't actually have a useful way for providing a
> > cellml model with enough machine readable information for someone to
> > rerun our model in exactly the same we as we had.
> It is getting closer. PCEnv has its own non-standard meta-data for the
> exact algorithm used, but we haven't been able to agree on a more
> general standard way to represent algorithms with gracious fallback yet.
> > By referencing
> > and/or including external code, we allow the step of exchanging a
> > model at the simulation level, which is actually not a bad thing if
> > our goal is to promote collaboration of model building.
> >
> I don't know if this is true, as I am not suggesting that we allow the
> stepping (integrator) algorithm itself to be exchanged (although that
> could be a different, future specification if needed).

Hmm, not necessarily the integrator algorithm, but perhaps the math of
individual components that are computed. People may share a generic
CellML integrator application, but find it easier to exchange model
descriptions that have already had the math reformulated into library
calls where they share the libraries and possibly develop them in
parallel without the need to hook them up into the integration
environment through an automatic interpretation of the MathML. That
could be a productive win, but at the cost of quite possibly having
the MathML version out of sync with the external code/library version.

Allowing external code is going to allow this workflow to happen.

One of the values of using MathML was to allow us to publish the math
in a model and know that it faithfully represents the code that was
used to generate the results published along side it. I think this
falls over a little where we say represent something awkwardly in
MathML for the sake of keeping it in MathML - e.g. mega huge piecewise
functions for representing perturbations - would we really want to
render these out for publication? You allude to something similar in
your response below. What would be the most appropriate way to produce
these parts for publication but still remain as error free in
translation as possible. I think as you suggest, publishing the code
that is actually used or at least a set of tests the library used
passed to be applicable to your context, would be the minimum to
ensure that a publication can be validated as exactly representing
what was run and produced the results. You will still always have the
possibility of error between the algorithm written by hand and the
behaviour of what is actually coded.

Edmund wondered the other day about a voting system for published
models. Everyone votes +1 for published models that actually write
things out without error and work when you try to implement them. This
is a less formal way of achieving a similar level of quality in
publications, and perhaps is a useful fall back for publications that
require assumed accuracy of hand written algorithms to actual
implementation.


> > I do think there is a possibility that people would abuse this; i.e.
> > jump straight to binding bits of code here and there together with
> > CellML; but if we maintain standards and best practice, then it should
> > be easy to show them up. Also, perhaps we should trust people to
> > evolve to only resorting to external code if it absolutely is the best
> > way to solve their problem.
> >
> > There are a couple of things that we could possibly lose by bringing
> > in external code:
> > 1) producing human readable equations for publication that accurately
> > reflect the mathematics in the model. Annotation of the algorithms or
> > maths in the external code would help, but would not guarantee that
> > the publication reflected exactly what was encoded in the model.
> >
> It is better that we have some equations in MathML than no equations at
> all. In the cases which my BCP document is targeting, you probably would
> want to represent the model in the paper as a mixture of equations (from
> the MathML) and pseudo-code (which would probably be hand-written). The
> CellML and machine-readable code would ideally be referred to in a
> repository, or provided as a supplement.

Sure. See comment above.

> > 2) ease of creating machine readable annotation for parts of the
> > external code that would require it - for example under MIRIAM to bind
> > each 'component' of a model to the relevant part of a reaction
> > network. This is where you would be questioning the modeler as to
> > whether their external code should be broken down and spread across
> > models. But they may not have control over the external source, or,
> > perhaps they are exchanging models that have necessarily lumped a lot
> > of biological concepts into one piece of external code(library)
> > because it's more efficient to solve that way; we now have a non
> > MIRIAM compliant model.
> >
> > I would like to think including linking to external code in the CellML
> > specification would push us to make a bigger effort on the procedures
> > for model validation, and get more encouraging involvement of various
> > modelers sitting out there with code that works; rather than thinking
> > we will somewhere lose some high level elegance of CellML.
> >
> I agree. Validation (i.e. testing) procedures for a specific class of
> CellML models containing external code (those containing machine
> learning techniques) is part of my PhD project.
> > Specific comments (quoted pieces from
> > http://www.cellml.org/Members/miller/bcp-external-models/ are enclosed
> > in triple quotes)
> >
> > """[CellML] models are very good at describing complete mathematical
> > models in a format which can be exchanged between model authors and
> > users. This adds significant value to a model representation, because
> > third parties can take the model, and use it in their preferred
> > software packages to reproduce any results the author published."""
> >
> > Need some clear examples of model types that cannot be expressed in
> > CellML, i.e. some algorithms that are best (or only) expressible at
> > the moment in procedural code. I know that various neural network
> > models and genetic algorithm based learning systems have evolved
> > mainly from procedural thought. I think we need to really consider
> > that some problems would be much better understood by model authors if
> > they are expressed in procedural code.
> >
> I'm not sure that we need this in the document, because the document is
> intended to provide best current practice guidelines. We could make a
> tutorial document (perhaps when the tools are better developed)
> describing examples of external code and how to make them work in the tools.
>

I think to get this proposal accepted (or rejected) it needs to
provide some end to end examples that are readable and point out the
application of the guidelines. Perhaps something simple - e.g. some
simple algebraic equations so that it can be compared directly to the
pure declarative cellml part (I know these should be obvious, but
still, having some written down helps people). But also something more
complicated - e.g. provide the CellML model with the external code
stub in the MathML and a hand written algorithm for something that
just isn't going to be able to be encoded in MathML.

Perhaps what I am also asking here is to clearly write out an example
of your motivation for this proposal; at the moment we can only
interpret this as a thought for the future in your work(and I think a
valid thought), but we tend not to want to complicate CellML under the
pretense of this possibly being useful in the future and to think we
can adequately conceptualise this and resolve it to specification
level detail without explicit examples to walk through.

> Examples:
> 1) Most machine learning techniques.
> 2) Stochastic sub-models, where there might not be a closed-form
> mathematical equation to go from the behaviour of individual parts of
> the model to the concentrations of each state.
> 3) Certain mathematical functions (e.g. those arising from various
> combinatorial problems) do not have a closed form, and will require
> external code to perform a numerical solve.

Something from 2 or 3 or both would be nice.

>
> > """Having part of a model expressed in CellML, and other parts
> > expressed in some more generic language is still useful, because it
> > means that the common part of the model can be re-used more easily,
> > either by providing external code of a different kind, or, where
> > possible by replacing the external code with MathML."""
> >
> > If external code can be replaced with MathML, then why wouldn't this
> > have been in a CellML component in the first place?
> >
> Because the original model might need to use external code, but someone
> has proposed a new model which no longer needs it.

Can you explain 'need' in this context and how at some point the
'need' disappears?

> > I see a pro and con where someone encodes most of a model in an
> > external code block bound into a single component of a model. The pro
> > would be that maybe this has helped promote someone actually bothering
> > to use cellml - as a first step, they simply wrapped their existing
> > code; in this case it would be up to repository maintainers to
> > encourage a breakdown of the model. The con of course is that we lose
> > model structure into the external code, and there is no way we can
> > automatically extract that. It is therefore effectively hidden until
> > broken down - if that ever makes sense for the model.
> >
> My guidelines try to discourage hiding model structure in external code
> when it is possible to do otherwise.

The guidelines recommend it; it would be nice to show it up. Perhaps
some metrics should be defined: eg: counting how many variables in a
component interface receive their source values from a single external
source.

>
> I agree that providing the tools to include external code would allow an
> incremental migration of models from procedural to CellML code. If this
> process is convenient for people, I don't see a problem with having
> transitional (non-published) models. I think that the current document
> encourages anyone doing this to complete the process to the maximum
> extent possible.
> > """It is also hoped that this specification will encourage model
> > developers to build up libraries of CellML accessible external code,
> > which can be re-used in a range of CellML models, therefore increasing
> > the range of modelling techniques available to CellML model
> > authors."""
> >
> > I would see an open library of external code being very useful. There
> > would need to be clear grading of that code, for example validating
> > that code even compiles(if it needs to) and run on x,y,z platforms.
> >
> There are lots of libraries like that out there already. Do you mean
> something CellML specific?

Yes.

> If so, I think that would be useful, but it
> is a bit early now, because tools need to develop interfaces to external
> code first, and start to standardise that a bit more. I don't think that
> is a pre-requisite for agreeing on an external code document.

I agree. It would be nice to make it as painless as possible for
people to understand what they need to provide for a model to achieve
a good repository rating. While this is definitely a more general
CellML issue, I imagine there would need to be a best practice for
providing tests, metadata, code (e.g. minimum documentation, build
instructions, test code, URL, etc), and description of the algorithm.
Perhaps there is a skeleton test harness API that people need to
implement.

> > """Best practice guidelines for CellML document authors"""
> >
> > """1. External code should be used only where a part of a model cannot
> > be adequately expressed in CellML. External code is often
> > non-portable, and using it reduces the re-usability of your model, and
> > so it should only be used when needed."""
> >
> > yes
> >
> > """2. External code should only perform the calculations that CellML
> > is unable to perform, with the rest of the calculations expressed as
> > MathML, in the CellML model. This is important, because increasings
> > the fraction of your model can be more easily re-used by other
> > modellers. It also means that CellML editing and visualisation
> > software will allow your model to be edited and visualised better."""
> >
> > yes and no. I don't think representing in MathML offers any more ease
> > for re-use unless you are all sharing a prescribed subset of MathML
> > and agree on the acceptable forms of equations if algebraic
> > manipulation is limited or not possible.
> >
> We do have a CellML-subset of MathML (which functions you can call). I
> agree that the declarative use of MathML is somewhat limited by the fact
> that most (all?) CellML tools can't do any symbolic algebra. However, as
> the very least it becomes possible to perform Newton-Raphson solves (you
> could potentially do this with procedural code too, of course, but you
> would at least want the code to be broken up into minimal functions so
> you don't have to do a multivariate solve, as requested in 3). However,
> just re-ordering equations will allow us some potential for re-use,
> especially for simple equations like x = y, or linear combinations of
> other variables (which it would be quite reasonable for CellML software
> to be able to manipulate, should there be sufficient demand for this).
>
> > """3. Modellers should, where feasible, separate external code into as
> > many different sub-functions as possible. For example, if you have
> > external code to compute y1 from x1 and x2, and y2 from x1 and x2, you
> > should write this as two separate external function applications,
> > unless there is a compelling reason to do otherwise (such as is the
> > case if it is much more efficient to compute them together). Doing
> > this makes it easier to modify the CellML model in the future, and
> > allows the CellML processing software to determine the order in which
> > expressions are evaluated, making your model more flexible."""
> >
> > see above ... the compromise will always be the amount of information
> > you can extract out of the model for other purposes - for example for
> > model reuse, for simply visualizing and understanding the makeup of
> > the model, for publication. It could be compelling enough for people
> > to produce at least one highly broken down model along with the one
> > fitted for optimization.
> >
> > """4. External code should, by itself, meet [MIRIAM] requirements 1
> > and 2. This means that the external code should be encoded in a
> > public, machine-readable format, and it should be valid and
> > compilable."""
> >
> > It should meet all the criteria of MIRIAM compliance as part of being
> > a model on the whole.
> Together with guideline 5, that is still implicitly required. I added
> this guideline in because we don't want to encourage people to publish
> their CellML XML, but not include the external code in an adequate
> format, and then claim that they have shared their model.
>
> The external code by itself isn't a model, so can't comply with the rest
> of the MIRIAM requirements.

Anyone is free to isolate the external code part of the math into its
own component and put this into its own model. At this point, the
model would have some value of MIRIAM compliance. I would think this
is a useful test of whether a component with some external code fouls
the compliance of a model to MIRIAM. How much of testing MIRIAM
compliance should be automatable?

> If people are publishing the model, the
> model as as a whole should be MIRIAM compliant, but that is a general
> concern, and is not specific to this document (hence it wouldn't make
> sense to give it as a guideline). I have tried to choose the
> best-practice guidelines so that modellers who follow them, as well as
> the CellML specification, will create a maximally useful, MIRIAM
> compliant model. I think that the goal as it is now is useful for this.
>
> > The test cases are going to be very important I
> > think in assuring the quality of external code.
> I'm not sure that testing the external code in isolation would always be
> useful, simply because there may not be any data about what inputs the
> external code should produce for a given output (i.e. the external code
> can only be tested by how well it works in the context of the entire model).

I need some more detail here. If someone following best practices of
the guideline provide an external function, then shouldn't they be
able to describe the valid domain and range of that function and be
able to demonstrate in isolation that it performs as expected? That's
all I was getting at. A full integration test within the whole model
is something else.

> > You might make the
> > case the external code is wrapped in its own model which itself would
> > need to be fully MIRIAM compliant. The MIRAM document is a bit weak
> > around the edges of things like validation and the annotation of
> > 'components' of a model.
> Hence why we need this guideline, which essentially requests that
> external code is written in a real programming language, and is valid
> under the rules of that programming language.

You were quite theoretical about what a real programming language was.
In the practical sense it's going to be those languages that we can
compile across the platforms required. If someone produces a Turing
complete language and implements a compiler for it for the Z80
instruction set, then this is not so helpful to us. Should we
encourage something more practical such as any 'real' (whatever that
is deemed to be) language for which there is a freely available
compiler for at least one of some set of platforms.

> > I think we need to be clear about what
> > validation is necessary for models that reference external code.
> >
> This is no different from the standard rules for validation (although we
> do of course need to be careful in the machine learning case that we are
> learning biology and not the same data we are testing with. However,
> this is a concern for another document, and could still apply to pure
> CellML models).
> > I would still like more clarification of how important MIRIAM is to
> > this; especially in that I think the requirements of MIRIAM haven't
> > really been designed with typical procedural code examples in mind. I
> > don't think MIRIAM can't cope with it.
> >
> > """5. The external code should be treated as part of the model. When a
> > model represented in CellML is published, the external code should be
> > published alongside it, unless it is part of a generally available
> > library of external code."""
> >
> > The latter part worries me a little. Enter license bewilderment. But see
> > 6.
> >
> > """6. The definitionURL used on csymbol elements should be a URL under
> > the control of the author. It is not necessary for there to actually
> > be a document accessible at the URL, as it is merely intended as a
> > unique identifier."""
> >
> > What happens with multiple authors? Will an author always guarantee a
> > method for creating a URL? I think this problem is related to 5. For
> > example, if the source code for an external component is submitted to
> > a repository and becomes licensed according to that, then the URL
> > should probably be related to that. So I think ultimately the domain
> > that wants to guarantee that the source is perpetually available
> > should be the domain that forms the base of the URL.
> >
> I meant author of the procedural code, not author of the CellML model.
> If the author of the procedural code doesn't provide a URL, the model
> author may need to make one under their control. I guess it would be
> worth clarifying that more.
>
> If the code has been written by one author, and then subsequently taken
> and modified by another, it should be the second author who specifies
> the URL, because we want to identify the modified version, not the original.
>
> The system is intended to simply provide a unique URI for the code.
> There is not supposed to be any implication that the code can be fetched
> from any location. The reason for requiring the domain be under the
> control of the 'author' is simply to avoid collisions.
>
> Perhaps:
> "The definitionURL used on csymbol elements should be a URL which
> uniquely identifies the external code being used. It is not necessary
> for there to actually be a document accessible at the URL, as it is
> merely intended as a unique identifier. If no existing URL has been
> assigned for the particular version of the external code being
> referenced, a new URL may be assigned. To avoid collisions, the person
> or entity assigning a URL should choose a URL under their control."
>
> That way, if a repository allocates URLs for external code, model
> authors could use that URL, which would help to ensure that commonly
> used external code has a single well-known URL.

Sure.

> > cheers
> > Matt
> >
> >
>
> _______________________________________________
> cellml-discussion mailing list
> cellml-discussion at cellml.org
> http://www.cellml.org/mailman/listinfo/cellml-discussion
>

cheers
Matt




Archive powered by MHonArc 2.6.18.

Top of page