From: Patrick Moran <pmoran@nas.nasa.gov>

Date: Tue Sep 2, 2003  4:29:46 PM US/Pacific

To: John Shalf <jshalf@lbl.gov>

Cc: diva@lbl.gov

Subject: Re: DiVA Survey (Please return by Sept 10!)

Reply-To: patrick.j.moran@nasa.gov

 

On Wednesday 27 August 2003 03:33 pm, you wrote:

John,

 

Here's my couple cents worth:

 

=============The Survey=========================

 

Please answer the attached survey with as much or as little verbosity

as you please and return it to me by September 10.  The survey has 3

mandatory sections and 4 voluntary (bonus) sections.  The sections are

as follows;

Mandatory;

         1) Data Structures

         2) Execution Model

         3) Parallelism and Load-Balancing

Voluntary;

         4) Graphics and Rendering

         5) Presentation

         6) Basic Deployment and Development Environment Issues

         7) Collaboration

We will spend this workshop focusing on the first 3 sections, but I

think we will derive some useful/motivating information from any

answers to questions in the voluntary sections.

 

I'll post my answers to this survey on diva mailing list very soon.

You can post your answers publicly if you want to, but I am happy to

regurgitate your answers as "anonymous contributors" if it will enable

you to be more candid in your evaluation of available technologies.

 

1) Data Structures/Representations/Management==================

The center of every successful modular visualization architecture has

been a flexible core set of data structures for representing data that

is important to the targeted application domain.  Before we can begin

working on algorithms, we must come to some agreement on common methods

(either data structures or accessors/method  calls) for exchanging data

between components of our vis framework.

 

There are two potentially disparate motivations for defining the data

representation requirements.  In the coarse-grained case, we need to

define standards for exchanging data between components in this

framework (interoperability).  In the fined-grained case, we want to

define some canonical data structures that can be used within a

component -- one developed specifically for this framework.  These two

use-cases may drive different set of requirements and implementation

issues.

         * Do you feel both of these use cases are equally important or should

we focus exclusively on one or the other?

 

I think both cases are important, but agreeing upon the fine-grained access

will be harder.

 

         * Do you feel the requirements for each of these use-cases are aligned

or will they involve two separate development tracks?  For instance,

using "accessors" (method calls that provide abstract access to

essentially opaque data structures) will likely work fine for the

coarse-grained data exchanges between components, but will lead to

inefficiencies if used to implement algorithms within a particular

component.

         * As you answer the "implementation and requirements" questions below,

please try to identify where coarse-grained and fine-grained use cases

will affect the implementation requirements.

 

I think the focus should be on interfaces rather than data structures.  I

would advocate this approach not just because it's the standard

"object-oriented" way, but because it's the one we followed with FEL,

and now FM, and it has been a big win for us.  It's a significant benefit

not having to maintain different versions of the same visualization

technique, each dedicated to a different method for producing the

data (i.e., different data structures).  So, for example, we use the same

visualization code in both in-core and out-of-core cases.  Assuming up

front that  an interface-based approach would be too slow is, in my

humble opinion, classic premature optimization.

 

What are requirements for the data representations that must be

supported by a common infrastructure.  We will start by answering Pat's

questions of about representation requirements and follow up with

personal experiences involving particular domain scientist's

requirements.

         Must: support for structured data

 

Structured data support is a must.

 

         Must/Want: support for multi-block data?

 

Multi-block is a must.

 

         Must/Want: support for various unstructured data representations?

(which ones?)

 

We have unstructured data, mostly based on tetrahedral or prismatic meshes.

We need support for at least those types.  I do not think we could simply

graft unstructured data support on top of our structured data structures.

 

         Must/Want: support for adaptive grid standards?  Please be specific

about which adaptive grid methods you are referring to.  Restricted

block-structured AMR (aligned grids), general block-structured AMR

(rotated grids), hierarchical unstructured AMR, or non-hierarchical

adaptive structured/unstructured meshes.

 

Adaptive grid support is a "want" for us currently, probably eventually

a "must".  The local favorite is CART3D, which consists of hierarchical

regular grids.  The messy part is that CART3D also supports having

more-or-less arbitrary shapes in the domain, e.g., an aircraft fuselage.

Handling the shape description and all the "cut cell" intersections

I expect will be a pain.

 

         Must/Want: "vertex-centered" data, "cell-centered" data?

other-centered?

 

Most of the data we see is still vertex-centered.  FM supports other

associations, but we haven't used them much so far.

 

         Must: support time-varying data, sequenced, streamed data?

 

Support for time-varying data is a must.

 

         Must/Want: higher-order elements?

 

Occasionally people ask about it, but we haven't found it to be a "must".

 

         Must/Want: Expression of material interface boundaries and other

special-treatment of boundary conditions.

 

We don't see this so much.  "Want", but not must.

 

         * For commonly understood datatypes like structured and unstructured,

please focus on any features that are commonly overlooked in typical

implementations.  For example, often data-centering is overlooked in

structured data representations in vis systems and FEM researchers

commonly criticize vis people for co-mingling geometry with topology

for unstructured grid representations.  Few datastructures provide

proper treatment of boundary conditions or material interfaces.  Please

describe your personal experience on these matters.

 

One thing left out of the items above is support for some sort of "blanking"

mechanism, i.e., a means to indicate that the data at some nodes are not

valid.  That's a must for us.  For instance, with Earth science data we see

the use of some special value to indicate "no data" locations.

 

         * Please describe data representation requirements for novel data

representations such as bioinformatics and terrestrial sensor datasets.

  In particular, how should we handle more abstract data that is

typically given the moniker "information visualization".

 

"Field Model" draws the line only trying to represent fields and the meshes

that the fields are based on.  I not really familiar enough with other types

of data to know what interfaces/data-structures would be best.  We haven't

see a lot of demand for those types of data as of yet.  A low-priority "want".

 

What do you consider the most elegant/comprehensive implementation for

data representations that you believe could form the basis for a

comprehensive visualization framework?

         * For instance, AVS uses entirely different datastructures for

structure, unstructured and geometry data.  VTK uses class inheritance

to express the similarities between related structures.  Ensight treats

unstructured data and geometry nearly interchangably.  OpenDX uses more

vector-bundle-like constructs to provide a more unified view of

disparate data structures.  FM uses data-accessors (essentially keeping

the data structures opaque).

 

Well, as you'd expect, as the primary author of Field Model (FM) I think it's

the most elegant/comprehensive of the lot.  It handles structured and

unstructured data.  It handles data non-vertex-centered data.  I think it

should be able to handle adaptive data, though it hasn't actually been

put to the test yet.  And of course every adaptive mesh scheme is a little

different.  I think it could handle boundary condition needs, though that's

not something we see much of.

 

         * Are there any of the requirements above that are not covered by the

structure you propose?

 

Out-of-core?  Derived fields? Analytic meshes (e.g., regular meshes)?

Differential operators?  Interpolation methods?

 

         * This should focus on the elegance/usefulness of the core

design-pattern employed by the implementation rather than a

point-by-point description of the implemenation!

 

I think if we could reasonably cover the (preliminary) requirments above,

that would be a good first step.  I agree with Randy that whatever we

come up with will have to be able to "adapt" over time as our understanding

moves forward.

 

         * Is there information or characteristics of particular file format

standards that must percolate up into the specific implementation of

the in-memory data structures?

 

In FM we tried hard to file-format-specific stuff out of the core model.

Instead, there are additional modules built on top of FM that handle

the file-format-specific stuff, like I/O and derived fields specific to

a particular format.  Currently we have PLOT3D, FITS, and HDFEOS4

modules that are pretty well filled out, and other modules that are

mostly skeletons at this point.

 

We should also be careful not to assume that analyzing the data starts

with "read the data from a file into memory, ...".  Don't forget out-of-core,

analysis concurrent with simulation, among others.

 

One area where the file-format-specific issues creep in is with metadata.

Most file formats have some sort of metadata storage support, some much

more elaborate than others.  Applications need to get at this metadata,

possibly through the data model, possibly some other way.  I don't have

the answer here, but it's something to keep in mind.

 

For the purpose of this survey, "data analysis" is defined broadly as

all non-visual data processing done *after* the simulation code has

finished and *before* "visual analysis".

         * Is there a clear dividing line between "data analysis" and "visual

analysis" requirements?

 

 

Your definition excludes concurrent analysis and steering from

"visualization".  Is this intentional?  I don't think there's a clear dividing

line here.

 

         * Can we (should we) incorporate data analysis functionality into this

framework, or is it just focused on visual analysis.

 

I think you would also want to include feature detection techniques.  For

large data analysis in particular, we don't want to assume that the scientist

will want to do the analysis by visually scanning through all the data.

 

         * What kinds of data analysis typically needs to be done in your

field?  Please give examples and how these functions are currently

implemented.

 

Around here there is interest in vector-field topology feature detection

techniques, for instance, vortex-core detection.

 

         * How do we incorporate powerful data analysis functionality into the

framework?

 

 

Carefully :-)?  By striving not to make a closed system.

 

 

2) Execution Model=======================

It will be necessary for us to agree on a common execution semantics

for our components.  Otherwise, while we might have compatible data

structures but incompatible execution requirements.  Execution

semantics is akin to the function of protocol in the context of network

serialization of data structures.  The motivating questions are as

follows;

         * How is the execution model affected by the kinds of

algorithms/system-behaviors we want to implement.

 

In general I see choices where at one end of the spectrum we have

simple analysis techniques where most of the control responsibilities

are handled from the outside.  At the other end we could have more

elaborate techniques that may handle load balancing, memory

management, thread management, and so on.  Techniques towards

the latter end of the spectrum will inevitably be intertwined more

with the execution model.

 

         * How then will a given execution model affect data structure

implementations

 

Well, there's always thread-safety issues.

 

         * How will the execution model be translated into execution semantics

on the component level.  For example will we need to implement special

control-ports on our components to implement particular execution

models or will the semantics be implicit in the way we structure the

method calls between components.

 

Not sure.

 

What kinds of execution models should be supported by the distributed

visualization architecture

         * View dependent algorithms? (These were typically quite difficult to

implement for dataflow visualization environments like AVS5).

 

Not used heavily here, but would be interesting.  A "want".

 

         * Out-of-core algorithms

 

A "must" for us.

 

         * Progressive update and hierarchical/multiresolution algorithms?

 

A "want".

 

         * Procedural execution from a single thread of control (ie. using an

commandline language like IDL to interactively control an dynamic or

large parallel back-end)

 

Scripting support is a "must".

 

         * Dataflow execution models?  What is the firing method that should be

employed for a dataflow pipeline?  Do you need a central executive like

AVS/OpenDX or, completely distributed firing mechanism like that of

VTK, or some sort of abstraction that allows the modules to be used

with either executive paradigm?

 

Preferably a design that does not lock us in to one execution model.

 

         * Support for novel data layouts like space-filling curves?

 

Not a pressing need here, as of yet.

 

         * Are there special considerations for collaborative applications?

         * What else?

 

Distributed control?  Fault tolerance?

 

How will the execution model affect our implementation of data

structures?

         * how do you decompose a data structure such that it is amenable to

streaming in small chunks?

 

Are we assuming streaming is a requirement?

 

How do you handle visualization algorithms where the access patterns

are not known a priori?  The predominant example: streamlines and streaklines.

Note the access patterns can be in both space and time.  How do you avoid

having each analysis technique need to know about each possible data

structure in order to negotiate a streaming protocol?  How do add another

data structure in the future without having to go through all the analysis

techniques and put another case in their streaming negotiation code?

 

In FM the fine-grained data access ("accessors") is via a standard

interface.  The evaluation is all lazy.  This design means more

function calls, but it frees the analysis techniques from having to know

access patterns a priori and negotiate with the data objects.  In FM

the data access methods are virtual functions.  We find the overhead

not to be a problem, even with relatively large data.  In fact, the overhead

is less an issue with large data because the data are less likely to be

served up from a big array buffer in memory (think out-of-core, remote

out-of-core, time series, analytic meshes, derived fields, differential-

operator fields, transformed objects, etc., etc.).

 

The same access-through-an-interface approach could be done without

virtual functions, in order to squeeze out a little more performance, though

I'm not convinced it would be worth it.  To start with you'd probably end up

doing a lot more C++ templating.  Eliminating the virtual functions would

make it harder to compose things at run-time, though you might be able

to employ run-time compilation techniques a la SCIRun 2.

 

         * how do you represent temporal dependencies in that model?

 

In FM, data access arguments have a time value, the field interface is

the same for both static and time-varying data.

 

         * how do you minimize recomputation in order to regenerate data for

view-dependent algorithms.

 

Caching?  I don't have a lot of experience with view-dependent algorithms.

 

What are the execution semantics necessary to implement these execution

models?

         * how does a component know when to compute new data? (what is the

firing rule)

         * does coordination of the component execution require a central

executive or can it be implemented using only rules that are local to a

particular component.

         * how elegantly can execution models be supported by the proposed

execution semantics?  Are there some things, like loops or

back-propagation of information that are difficult to implement using a

particular execution semantics?

 

The execution models we have used have kept the control model in

each analysis technique pretty simple, relying on an external executive.

The one big exception is with multi-threading.  We've experimented with

more elaborate parallelism and load-balancing techniques, motivated in

part by latency hiding desires.

 

How will security considerations affect the execution model?

 

More libraries to link to?  More latency in network communication?

 

 

3) Parallelism and load-balancing=================

Thus far, managing parallelism in visualization systems has been a

tedious and difficult at best.  Part of this is a lack of powerful

abstractions for managing data-parallelism, load-balancing and

component control.

 

Please describe the kinds of parallel execution models that must be

supported by a visualization component architecture.

         * data-parallel/dataflow pipelines?

         * master/slave work-queues?

         * streaming update for management of pipeline parallelism?

         * chunking mechanisms where the number of chunks may be different from

the number of CPU's employed to process those chunks?

 

We're pretty open here.  Mostly straight-forward work-queues.

 

         * how should one manage parallelism for interactive scripting

languages that have a single thread of control?  (eg. I'm using a

commandline language like IDL that interactively drives an arbitrarily

large set of parallel resources.  How can I make the parallel back-end

available to a single-threaded interactive thread of control?)

 

I've used Python to control multiple execution threads.  The (C++)

data objects are thread safe, the minimal provisions for thread-safe

objects in Python haven't been too much of a problem.

 

Please describe your vision of what kinds of software support /

programming design patterns are needed to better support parallelism

and load balancing.

 

         * What programming model should be employed to express parallelism.

(UPC, MPI, SMP/OpenMP, custom sockets?)

         * Can you give some examples of frameworks or design patterns that you

consider very promising for support of parallelism and load balancing.

(ie. PNNL Global Arrays or Sandia's Zoltan)

                       http://www.cs.sandia.gov/Zoltan/

                       http://www.emsl.pnl.gov/docs/global/ga.html

         * Should we use novel software abstractions for expressing parallelism

or should the implementation of parallelism simply be an opaque

property of the component? (ie. should there be an abstract messaging

layer or not)

         * How does the NxM work fit in to all of this?  Is it sufficiently

differentiated from Zoltan's capabilities?

 

I don't have a strong opinion here.  I'm not familiar with Zoltan et al.

Our experience with parallelism tends to be more shared-memory than

distributed memory.

 

 

===============End of Mandatory Section (the rest is

voluntary)=============

 

 

Laziness prevails here, so I'll stop for now :-).

 

Pat