CellML Discussion List

Text archives Help


[cellml-discussion] Auto-generate HDF5 from CellML?


Chronological Thread 
  • From: jonovik at gmail.com (Jon Olav Vik)
  • Subject: [cellml-discussion] Auto-generate HDF5 from CellML?
  • Date: Sat, 15 Nov 2008 21:25:44 +0000 (UTC)

David Nickerson <david.nickerson at ...> writes:

> On Sat, Nov 15, 2008 at 3:55 PM, Jon Olav Vik <jonovik at ...> wrote:
[snip]
> > Yes, but there needs to be *some* way of identifying the information in
> > the
> > HDF5 file, like "using parameter values as indexes". A purist solution
might be
> > to have each simulation result annotated with the URI for that particular
> > parameter set and model. However, any analysis would then require running
back
> > and forth between the CellML model (DOM API, metadata, ...) and the huge
output
> > files (e.g. HDF5). Until the CellML tools (DOM, code generation, ...) fit
> > seamlessly into more mainstream tools, I'd prefer not to lug around the
CellML
> > DOM API everywhere I take my data. (No offense.
>
> but doesn't this get back to the issue of needing all the information
> like units in the HDF5 data file also?

First of all, thanks for a very constructive answer.

> If you'd prefer to have an HDF5
> data file that can be unambiguously interpreted without reference to
> the source CellML models and/or simulations,

No, that's not what I meant. The *required* information in the HDF5 file
would
be something like:
a) unambiguous identification of model
b) unambiguous specification of parameter values
c) unambiguous identification of output variables
For a), the URI to the model should be an ideal "canonical" reference (but
I'd
still appreciate a human-readable, (autogenerated) text-formatted reference
as
annotation).
For b), the URI to a CellML simulation spec might be an ideal canonical
reference.
For c), I guess the HDF5 array names should simply be the variable names from
the CellML model.

> then that is a whole lot
> more data that would need to be in the data file....

I do intend for the HDF5 file to be interpreted in conjunction with the
CellML
model. However, things like a human-readable citation in addition to the URI
would be *convenient*, as would a copy of the parameter values used in the
simulation.

Regarding redundancy, it would be understood that the *official* parameter
values were those in the CellML simulation specification. Consistency could
be
verified automatically (e.g. by a CRC checksum) whenever desired. It would be
an error to change those parameter values after that part of the HDF5
structure
was begun.

Regarding storage space, the 180 kB required for e.g. the Bondarenko model
are
negligible compared to the megabytes of output for even a modest exploration
of
parameter combinations.

> if you want to interpret the data in the file without needing to go
> back and forth with the CellML models then I'd guess you probably want
> to add some tool-specific data to the HDF5 group that gets generated
> by the proposed tool/service...or not.

Yes, that's about what I had in mind. Similar in some ways to the various
tabs
on the repository webpages of cellml.org, or the autogenerated code in
different programming languages: Certainly redundant, but convenient.

> Maybe the below has convinced
> me that this could be done in a nice way...
>
> > I was thinking of this extra annotation as "write once, read many", just
> > labelling the boxes. There exist external tools for exploring HDF5 files,
> > http://www.hdfgroup.org/hdf-java-html/hdfview/
> > and these will be a lot less useful if the data structure doesn't indicate
> > which parameters a result is for. (That said, it might be useful to
> > verify
the
> > integrity of the link between model, parameters and output e.g. by some
kind of
> > hashing.)
>
> This sounds more like you are after a complete translation of the
> source models and simulations into HDF5.

I don't know about "into" HDF5; that's just a vehicle for storing numbers,
has
no concept of mathematical functions, etc. (...and I know that you know this
better than I do 8-) I'm just after human-readable annotation to aid in
navigation and exploration of the data. (Well, maybe some machine-readable
annotation too; units would be nice to give meaning to the numbers.)

> For a given model you'd have
> a list of all the "unique" variables in the model annotated with a
> string containing the full expansion of the variable's units into the
> set of base units, and the variable's value field - which would be a
> scalar for constant parameters and an array for dynamic variables. I
> guess you'd also want some kind of reference to the index field (i.e.,
> time). Not sure if you'd also want to keep track of all the actual
> variables in the model that are used for each of the unique variables
> in the simulation instantiation, but that could be done.
>
> In such a tool you'd still lose a lot of the annotation in the source
> CellML models. But I guess if you simply want an optimised data store
> the above should give you everything you need and if required in
> special cases you can also link back to the CellML models as there
> should still be some URI's stored somewhere in the HDF5 data file. Of
> course, if you want to do all this nice and quickly you'd likely
> ignore the units anyway if you know that all your simulations are in
> compatible or identical units so they can be left back in the CellML
> model and can be looked up if needed.

To me this sounds very promising. I'd be interested to hear what others think.

> One consideration with such a solution is that I have found the HDF5
> packet table interface to be about the most efficient way to stream
> simulation data to a persistent store. I have one packet table per
> simulation and use the model variable URI's to set up a mapping into
> that packet table for each dynamic variable. So rather than using the
> variable field of dynamic variables for an array, it is probably more
> efficient to set it up as an index or something into the packet
> table....sounds like it should be workable :)

I must admit I do not yet know what a "packet table" is. I think it might be
very helpful if you could write up a toy example of how you currently use
HDF5
with CellML models.

[As for myself and my short-term hacks, I'm currently leaning towards Numpy
ndarrays or recarrays (allowing reference to array "columns" by name) stored
in
HDF5 via pytables.
http://thread.gmane.org/gmane.comp.python.numeric.general/22250/ ]

Best regards,
Jon Olav






Archive powered by MHonArc 2.6.18.

Top of page