CellML Discussion List

Text archives Help


[cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?


Chronological Thread 
  • From: matt.halstead at auckland.ac.nz (Matt )
  • Subject: [cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?
  • Date: Thu, 26 Apr 2007 13:37:17 +1200

So the main issues you are describing so far are:

- the time taken to deserialize a model into a programming environment
- the time taken to perform operations over the model and especially
concerned about changes happening to the model while these operations
are being performed

My assumption is that you want to maintain the current behaviour of
the cellml API which does assume that changes can happen at any time
but does not assume (but does advertise) that a model has been
completely loaded if we are dealing with imports.

A couple of questions I have:

1) How important is it to you to have speedy deserialisation? For
example, if the model was stored in a database then you may hardly
ever need to serialise and deserialize from the CellML/XML form.

2) How important is it for the model to be able to be updated at any
time? For example, if the model was write-locked during these time
expensive operations, or during a call to a generator, then would this
be too restricting?

My own inclination is that using a database back end is the natural
step to make if you want to make this scalable. I presume you want
your upper maximum in tests to be in the gigabyte range with obvious
pointers to terabytes?

A database backend would make a nice addition to the implementation of
the API as an optional storage backend.

Addressing write-locks would become a more general problem to resolve.

cheers
Matt


On 4/26/07, Andrew Miller <ak.miller at auckland.ac.nz> wrote:
> Randall Owen wrote:
> > In message <462D5FD0.4080701 at auckland.ac.nz> CellML Discussion List
> > <cellml-discussion at cellml.org> writes:
> >
> >> Hi,
> >>
> >> I am working on developing a CellML model (using external code) of
> >> transcriptional control in yeast which is 23 MB in size. I hope to
> >> eventually do a similar thing for organisms which have much more
> >> complicated sets of interactions, in which case this size may grow
> >> substantially.
> >>
> >> If anyone on this list is interested in similar problems (I presume
> >> similar issues come up in a range of systems biology problems, whether
> >> you are working with CellML or SBML), I would welcome your feedback and
> >> suggestions, and perhaps we could collaborate .
> >>
> >> This creates some unique issues for CellML processing tools:
> >> 1) Just parsing the CellML model (especially with a DOM-type parser
> >> which stores all the nodes into a tree, but probably with any type of
> >> parser) is very slow.
> >> 2) The CellML model might not all fit in memory at the same time,
> >> especially if the model gets to be multi-gigabyte. It might be possible
> >> to make use of swap to deal with this, but if the algorithms don't have
> >> explicit control over when things are swapped in and out, it will be
> >> hard to work with such a model.
> >> 3) The CellML model is much larger than it needs to be, which makes it
> >> inconvenient to exchange with third parties.
> >> 4) The current CellML API implementation has been designed for maximum
> >> flexibility ('one size fits all'), but this flexibility (e.g. supporting
> >> live iterators, access to arbitrary extension elements, and so on) is
> >> expensive for very large models. Much of this expensive functionality is
> >> probably unnecessary for most tools, although what is and is not
> >> necessary depends on the tool being used.
> >>
> >
> > This might be an area in which lazy evaluation and and
> > parametrically-polymorphic programming might offer some advantages. In
> > particular, with lazy evaluation the data structures are only constructed
> > and arguments evaluated only when needed.
> The code currently uses top-down dynamic programming, which is
> essentially the same thing here (i.e. it computes things like iterators
> exactly as they are needed, but it caches any data it can until it is
> invalidated by a global serial number is changed).
> > In fact, demand-driven programming allows finiste results to be obtained
> > from infinite data structures. This might alleviate having to store the
> > entire model in memory at one time. In addition, in many cases
> > intermediate data structures, though implied in the program, may not have
> > to actually exist.
> >
> The way that these things currently work internally is as follows:
> 1) The entire XML file is parsed into our internal DOM representation,
> all in one go.
> This step happens synchronously and in one go at load time. To improve
> on this, we would have to keep referring back to the XML file (which
> could be a network resource) every time we required more data from the
> model. We could cache the file to avoid parsing it, but XML is not a
> very good format for random access, so I suspect this would cause worse
> performance problems, unless we had a binary format optimised for this.
>
> 2) A single top-level object for the model is created. This object is
> created in constant time when the model is loaded.
>
> Using this top-level object, it is then possible to iterate over the
> contents of the model. At this point, CellML element objects are set up,
> and a mapping is set up between these CellML element objects and their
> corresponding DOM element objects. This mapping is permanent in the
> sense that it persists even if the model is changed (however, the
> mapping is removed when the CellML element is deleted).
>
> 3) Additional objects are created and cached as required, but this cache
> is subject to expiry if the latest serial number (bumped up each time a
> change is made) doesn't match the serial number on the cache.
>
> I think any approach to do better has to be based upon bottom-up dynamic
> programming. This means that we need to know something about what
> information we will need before we use it (which is not so great for a
> general API, which is why we need a second, slimmed down version).
> Things we can do:
> a) Not storing information which is not needed for a particular tool
> (need to specify what information is needed somehow, or perhaps more
> simply, just target the slimmed-down API at certain classes of tools,
> and exclude e.g. editors, which need all information so they can
> re-serialise the model without loss).
> b) Performing some calculations early, with some view of what
> information is required. This allows, e.g. throwing away any data we
> know we won't need.
>
> The current API has been built assuming that the model could be changed
> at any point, and so it tries not to do any more work in terms of saving
> data than it needs to, and that any changes need to be re-serialised out.
>
> Examples of major issues:
> => Not much indexing is feasible with the current API (it does index by
> names, but doesn't build any more complex indices, because in tools
> which keep changing the model, this would be too slow). Building indices
> is expensive, but it can give significant gains in the long run if we
> know that we will make good use of the index. The slimmed down API could
> probably make this assumption, while the normal, read-write API cannot.
> => Elements can appear in any order in the CellML file, and there are
> not separate container elements for different types of elements. This
> means that we often have to do a lot of linear-time searches through a
> long list of elements (the API does build tree-based indices for looking
> up names, but it is not feasible to do much more if the model is prone
> to constant changes and iterators must remain live). If we can assume
> there are no changes, we could build an index of elements, and look them
> up efficiently.
> => The graph of connected variables are very expensive to deal with if
> you can't assume that it is worthwhile to build a datastructure for
> this. For example, if you are iterating all variables which are
> connected to a variable, it becomes necessary to take into account the
> possibility that a new variable will be added or a variable will be
> removed while the iterator is still being used, which makes the
> algorithm very expensive. On the other hand, if you can assume no
> changes, you can just build a disjoint set data-structure starting with
> each variable in its own set, then process the connections, merging sets
> as you go (i.e. linear time to process all variables in the model).
>
> >
> >
> >> In practice, nearly all existing CellML specific tools handle the file
> >> badly. For example, PCEnv runs out of memory if you try to load the
> >> file, while Jonathan Cooper's CellML validator just sits at 100% of a
> >> single CPU for a long time (at least 15 minutes on my system, and still
> >> running at the time of writing, but the time will obviously depend on
> >> system speed).
> >>
> >> There are some possible ways to improve on this:
> >> A) There are ways to generate ASN.1 schemata from the XML schemata. This
> >> could be used to produce an efficient (in terms of both data size and
> >> parse time) binary representation of CellML, with the possibility to
> >> convert back to CellML. For example, Fast Infoset, and the similar (but
> >> non ASN.1 based) BiM.
> >> B) A database-based representation of a large CellML model could be
> >> used, either through an XML-enabled database, or more likely, some
> >> mapping layer. This would allow the model to be loaded into the database
> >> once, and the relevant parts retrieved from the on-disk database as
> >> required, in an algorithmically sensible way. It is worth noting that my
> >> model is generated using data from a relational database (a process
> >> which takes up to a minute), but I would like the next step of my
> >> pipeline to generalise to other CellML inputs.
> >> C) Another leaner API, read-only CellML API (perhaps based off the same
> >> IDLs, but with certain functionality, like the ability to modify the
> >> model, or set mutation event listeners, unavailable). We could add a
> >> SAX-style event dispatcher instead, to allow users to save any
> >> information they do want from extension elements, which will also not be
> >> kept in the model. Comments, white-space, and so on would all be
> >> stripped unlike in the current CellML API implementation. Tools which
> >> are currently using the full CellML API but only require read-only
> >> access (e.g. the CCGS) might be able to just 'flick the switch' and
> >> benefit from the leaner API.
> >>
> >> I would welcome any opinions, comments, suggestions, or collaborations
> >> on this.
> >>
> >
> > Perhaps this is where a Haskell API might be very useful.
> >
> While this would be useful for anyone developing CellML tools in
> Haskell, I'm not sure that this would really help with the performance
> issues, given that similar techniques are already used, with some
> explicit optimisation of the algorithms.
>
> Best regards,
> Andrew
>
> _______________________________________________
> cellml-discussion mailing list
> cellml-discussion at cellml.org
> http://www.cellml.org/mailman/listinfo/cellml-discussion
>




Archive powered by MHonArc 2.6.18.

Top of page