CellML Discussion List

Text archives Help


[cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?


Chronological Thread 
  • From: ak.miller at auckland.ac.nz (Andrew Miller)
  • Subject: [cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?
  • Date: Thu, 26 Apr 2007 11:58:04 +1200

Randall Owen wrote:
> In message <462D5FD0.4080701 at auckland.ac.nz> CellML Discussion List
> <cellml-discussion at cellml.org> writes:
>
>> Hi,
>>
>> I am working on developing a CellML model (using external code) of
>> transcriptional control in yeast which is 23 MB in size. I hope to
>> eventually do a similar thing for organisms which have much more
>> complicated sets of interactions, in which case this size may grow
>> substantially.
>>
>> If anyone on this list is interested in similar problems (I presume
>> similar issues come up in a range of systems biology problems, whether
>> you are working with CellML or SBML), I would welcome your feedback and
>> suggestions, and perhaps we could collaborate .
>>
>> This creates some unique issues for CellML processing tools:
>> 1) Just parsing the CellML model (especially with a DOM-type parser
>> which stores all the nodes into a tree, but probably with any type of
>> parser) is very slow.
>> 2) The CellML model might not all fit in memory at the same time,
>> especially if the model gets to be multi-gigabyte. It might be possible
>> to make use of swap to deal with this, but if the algorithms don't have
>> explicit control over when things are swapped in and out, it will be
>> hard to work with such a model.
>> 3) The CellML model is much larger than it needs to be, which makes it
>> inconvenient to exchange with third parties.
>> 4) The current CellML API implementation has been designed for maximum
>> flexibility ('one size fits all'), but this flexibility (e.g. supporting
>> live iterators, access to arbitrary extension elements, and so on) is
>> expensive for very large models. Much of this expensive functionality is
>> probably unnecessary for most tools, although what is and is not
>> necessary depends on the tool being used.
>>
>
> This might be an area in which lazy evaluation and and
> parametrically-polymorphic programming might offer some advantages. In
> particular, with lazy evaluation the data structures are only constructed
> and arguments evaluated only when needed.
The code currently uses top-down dynamic programming, which is
essentially the same thing here (i.e. it computes things like iterators
exactly as they are needed, but it caches any data it can until it is
invalidated by a global serial number is changed).
> In fact, demand-driven programming allows finiste results to be obtained
> from infinite data structures. This might alleviate having to store the
> entire model in memory at one time. In addition, in many cases intermediate
> data structures, though implied in the program, may not have to actually
> exist.
>
The way that these things currently work internally is as follows:
1) The entire XML file is parsed into our internal DOM representation,
all in one go.
This step happens synchronously and in one go at load time. To improve
on this, we would have to keep referring back to the XML file (which
could be a network resource) every time we required more data from the
model. We could cache the file to avoid parsing it, but XML is not a
very good format for random access, so I suspect this would cause worse
performance problems, unless we had a binary format optimised for this.

2) A single top-level object for the model is created. This object is
created in constant time when the model is loaded.

Using this top-level object, it is then possible to iterate over the
contents of the model. At this point, CellML element objects are set up,
and a mapping is set up between these CellML element objects and their
corresponding DOM element objects. This mapping is permanent in the
sense that it persists even if the model is changed (however, the
mapping is removed when the CellML element is deleted).

3) Additional objects are created and cached as required, but this cache
is subject to expiry if the latest serial number (bumped up each time a
change is made) doesn't match the serial number on the cache.

I think any approach to do better has to be based upon bottom-up dynamic
programming. This means that we need to know something about what
information we will need before we use it (which is not so great for a
general API, which is why we need a second, slimmed down version).
Things we can do:
a) Not storing information which is not needed for a particular tool
(need to specify what information is needed somehow, or perhaps more
simply, just target the slimmed-down API at certain classes of tools,
and exclude e.g. editors, which need all information so they can
re-serialise the model without loss).
b) Performing some calculations early, with some view of what
information is required. This allows, e.g. throwing away any data we
know we won't need.

The current API has been built assuming that the model could be changed
at any point, and so it tries not to do any more work in terms of saving
data than it needs to, and that any changes need to be re-serialised out.

Examples of major issues:
=> Not much indexing is feasible with the current API (it does index by
names, but doesn't build any more complex indices, because in tools
which keep changing the model, this would be too slow). Building indices
is expensive, but it can give significant gains in the long run if we
know that we will make good use of the index. The slimmed down API could
probably make this assumption, while the normal, read-write API cannot.
=> Elements can appear in any order in the CellML file, and there are
not separate container elements for different types of elements. This
means that we often have to do a lot of linear-time searches through a
long list of elements (the API does build tree-based indices for looking
up names, but it is not feasible to do much more if the model is prone
to constant changes and iterators must remain live). If we can assume
there are no changes, we could build an index of elements, and look them
up efficiently.
=> The graph of connected variables are very expensive to deal with if
you can't assume that it is worthwhile to build a datastructure for
this. For example, if you are iterating all variables which are
connected to a variable, it becomes necessary to take into account the
possibility that a new variable will be added or a variable will be
removed while the iterator is still being used, which makes the
algorithm very expensive. On the other hand, if you can assume no
changes, you can just build a disjoint set data-structure starting with
each variable in its own set, then process the connections, merging sets
as you go (i.e. linear time to process all variables in the model).

>
>
>> In practice, nearly all existing CellML specific tools handle the file
>> badly. For example, PCEnv runs out of memory if you try to load the
>> file, while Jonathan Cooper's CellML validator just sits at 100% of a
>> single CPU for a long time (at least 15 minutes on my system, and still
>> running at the time of writing, but the time will obviously depend on
>> system speed).
>>
>> There are some possible ways to improve on this:
>> A) There are ways to generate ASN.1 schemata from the XML schemata. This
>> could be used to produce an efficient (in terms of both data size and
>> parse time) binary representation of CellML, with the possibility to
>> convert back to CellML. For example, Fast Infoset, and the similar (but
>> non ASN.1 based) BiM.
>> B) A database-based representation of a large CellML model could be
>> used, either through an XML-enabled database, or more likely, some
>> mapping layer. This would allow the model to be loaded into the database
>> once, and the relevant parts retrieved from the on-disk database as
>> required, in an algorithmically sensible way. It is worth noting that my
>> model is generated using data from a relational database (a process
>> which takes up to a minute), but I would like the next step of my
>> pipeline to generalise to other CellML inputs.
>> C) Another leaner API, read-only CellML API (perhaps based off the same
>> IDLs, but with certain functionality, like the ability to modify the
>> model, or set mutation event listeners, unavailable). We could add a
>> SAX-style event dispatcher instead, to allow users to save any
>> information they do want from extension elements, which will also not be
>> kept in the model. Comments, white-space, and so on would all be
>> stripped unlike in the current CellML API implementation. Tools which
>> are currently using the full CellML API but only require read-only
>> access (e.g. the CCGS) might be able to just 'flick the switch' and
>> benefit from the leaner API.
>>
>> I would welcome any opinions, comments, suggestions, or collaborations
>> on this.
>>
>
> Perhaps this is where a Haskell API might be very useful.
>
While this would be useful for anyone developing CellML tools in
Haskell, I'm not sure that this would really help with the performance
issues, given that similar techniques are already used, with some
explicit optimisation of the algorithms.

Best regards,
Andrew





Archive powered by MHonArc 2.6.18.

Top of page