CellML Discussion List

Text archives Help


[cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?


Chronological Thread 
  • From: ak.miller at auckland.ac.nz (Andrew Miller)
  • Subject: [cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?
  • Date: Tue, 24 Apr 2007 13:39:28 +1200

Hi,

I am working on developing a CellML model (using external code) of
transcriptional control in yeast which is 23 MB in size. I hope to
eventually do a similar thing for organisms which have much more
complicated sets of interactions, in which case this size may grow
substantially.

If anyone on this list is interested in similar problems (I presume
similar issues come up in a range of systems biology problems, whether
you are working with CellML or SBML), I would welcome your feedback and
suggestions, and perhaps we could collaborate .

This creates some unique issues for CellML processing tools:
1) Just parsing the CellML model (especially with a DOM-type parser
which stores all the nodes into a tree, but probably with any type of
parser) is very slow.
2) The CellML model might not all fit in memory at the same time,
especially if the model gets to be multi-gigabyte. It might be possible
to make use of swap to deal with this, but if the algorithms don't have
explicit control over when things are swapped in and out, it will be
hard to work with such a model.
3) The CellML model is much larger than it needs to be, which makes it
inconvenient to exchange with third parties.
4) The current CellML API implementation has been designed for maximum
flexibility ('one size fits all'), but this flexibility (e.g. supporting
live iterators, access to arbitrary extension elements, and so on) is
expensive for very large models. Much of this expensive functionality is
probably unnecessary for most tools, although what is and is not
necessary depends on the tool being used.

In practice, nearly all existing CellML specific tools handle the file
badly. For example, PCEnv runs out of memory if you try to load the
file, while Jonathan Cooper's CellML validator just sits at 100% of a
single CPU for a long time (at least 15 minutes on my system, and still
running at the time of writing, but the time will obviously depend on
system speed).

There are some possible ways to improve on this:
A) There are ways to generate ASN.1 schemata from the XML schemata. This
could be used to produce an efficient (in terms of both data size and
parse time) binary representation of CellML, with the possibility to
convert back to CellML. For example, Fast Infoset, and the similar (but
non ASN.1 based) BiM.
B) A database-based representation of a large CellML model could be
used, either through an XML-enabled database, or more likely, some
mapping layer. This would allow the model to be loaded into the database
once, and the relevant parts retrieved from the on-disk database as
required, in an algorithmically sensible way. It is worth noting that my
model is generated using data from a relational database (a process
which takes up to a minute), but I would like the next step of my
pipeline to generalise to other CellML inputs.
C) Another leaner API, read-only CellML API (perhaps based off the same
IDLs, but with certain functionality, like the ability to modify the
model, or set mutation event listeners, unavailable). We could add a
SAX-style event dispatcher instead, to allow users to save any
information they do want from extension elements, which will also not be
kept in the model. Comments, white-space, and so on would all be
stripped unlike in the current CellML API implementation. Tools which
are currently using the full CellML API but only require read-only
access (e.g. the CCGS) might be able to just 'flick the switch' and
benefit from the leaner API.

I would welcome any opinions, comments, suggestions, or collaborations
on this.

Best regards,
Andrew





Archive powered by MHonArc 2.6.18.

Top of page