Multi-Source Data Analysis

Sam Uselton
MRJ Technology Solutions
NASA Ames Research Center

An Increasingly Common Requirement

At least three factors are converging to make multi-source data analysis pervasive in the near future. Digital data acquisition is becoming easier and cheaper. Computational simulations are gaining fidelity and detail while becoming more practical to compute. And everything is becoming networked so data from many sources can be reached by a single user or application.

>From cross validation of computational and experimental models to steering computational simulations with real world observations, bringing data from multiple sources together is much more powerful than using each source separately. And computer systems can provide support for users in situations where they would be overwhelmed by volume or complexity without the support. But multi-source data analysis is harder than single source data analysis, and designing, building and deploying tools for others to use for it is very hard.

Challenges

To support multi-source data analysis well requires a system to control and overcome difficulties of several kinds. I have divided these difficulties into four categories; this is not the only way these ideas could be organized, nor is this list exhaustive. The categories I have selected are (1) data difficulties, (2) user and usage difficulties, (3) software engineering and development difficulties and (4) visualization and analysis difficulties.

Multi-source data is complex, heterogeneous, dynamic, distributed and very large. Data from computational simulations is extremely complex in many ways, from the geometry of grid systems to the variety of data types and the variety of phenomena being modeled. Looking at such data from several sources, or combining it with complementary data of other kinds multiplies this complexity. A particularly significant aspect of this complexity is the heterogeneity of the data, from file formats to numerical representations, units and sampling densities, variety rules. This data is distributed broadly across many computer systems and many geographic locations. Much of this data is dynamic, especially "current" observations whether from remote sensing satellite instruments or ground based weather measurements. And if some of the computational data sets are large, then collections of a few of these with thousands or millions of smaller, related items, makes a truly massive collection.

In some odd way, the same terms can be used to describe the difficulties related to users and the ways they want to use multi-source data. The users are distributed broadly, but unevenly. They are heterogeneous in their training, backgrounds and (especially important) their intended use of the data. The group of users is dynamic and complex, and as the group becomes large, additional difficulties arise. Sharing data requires the usual locks and protections, but collaborative sessions in which distributed groups of users all interact with the same data at the same time have much tougher requirements.

The central software engineering problem is that no one piece of software can hope to do everything that all these users will ever want to do to the data available to them. But they don't want to spend time learning many software packages just because no single set of tools is complete. The same five terms apply. The complete set of relevant software will be large and complex. It will need to be run on a variety of platforms and sometimes distributed across several. Expect the software to come from companies, research labs, universities and the users - very heterogeneous. And useful new bits will arise constantly, so its dynamic too. What is needed is a system that can flexibly accomodate a dynamic collection of software in a way that users can customize the support they receive in hiding the complexities or interfacing controls to them.

By now it should be obvious that the visualization and analysis difficulties can also be described by the same five key terms. When the number of data items is large it soon exceeds the number of pixels on a screen. Complex collections of heterogeneous data require careful re-thinking of visualization methods. Several fields, each customarily displayed in the same way, can not all be displayed together without confusion or substantial changes. Animation helps with some dynamics, but generally one must limit the parameters of the visualization being animated. And large, distributed data sets, mean that new methods must be considered to overcome bandwidth and latency limitations.

Approaches

The Data Analysis Group of the NAS Systems Division at NASA Ames Research Center is working on several projects that address some of the difficulties described above. One idea that is sufficiently developed to have begun to bear fruit concerns the software engineering problems. For quite awhile we have seen the potential benefits of code reuse and therefore discussed object oriented software design. The needs of the group range from platforms for testing new visualization ideas to tools for building customized applications for small collections of users. Devising a single object hierarchy that fit all our needs seemed impossible. We eventually settled on an idea championed by Michael Gerald-Yamasaki, of developing a collection of libraries, each targetted at a different portion of a complete visualization system. These libraries are called Horizontal Products. Complete applications built completely or substantially from these libraries are called Vertical Products. The development of each Vertical Product is intended to use as much of the Horizontal Products as possible, but to allow easy integration of additional code specialized for the application.

These Horizontal Products are still under development. The first one, called the Field Encapsulation Library (FEL) was the topic of a paper presented at Visualization '96. We learned a great deal in the initial process, which has been used to inform a thorough revision of FEL as well as the development of a Visualization Technique Library, and the beginnings of other Horizontal Products.

We have not stopped working on applications while these tools are incomplete. Example applications have been used as Vertical Product proxies to drive the development process. The software specifically developed with multi-source data in mind, exVis and VISOR, has been used to unearth requirements for FEL and the Visualization Techniques Library. The Virtual Wind Tunnel has been completely converted to rely on the new FEL. But the strength of the strategy is shown in the ability to create new tools quickly and reliably. These results are just being demonstrated, resulting in features quickly added to existing applications and in several small applications described in papers and case studies being submitted to Visualization '98. Another useful capability, and one of our long term goals, is the ability to integrate software from diverse sources into applications that can be packaged and made available to users.

We have also done some work which addresses large data set issues. The first step simply involved reorganizing the data from the initial format (notable primarily for the ease of writing it from Cray Fortran) to a format which groups values frequently accessed together. Noticing that many of our common visualization techniques need only access small portions of a dataset led to work on application controlled paging, presented at Visualization '97. Multiresolution techniques are of interest, but particularly difficult to do well with complex data. Feature extraction is another strategy we are pursuing with some success, although these methods tend to be more application specific.

We have begun work to address problems of large data collections as well. The initial thrust is to adapt database technology to our needs. Work done at Stanford University on heterogeneous database access seems particularly promising, and we have established a joint effort with a group there. We expect to differ from traditional database applications in may ways including methods for subsetting and summarizing data and for controlling the location where derived data is computed. Methods for finding data sets which meet certain criteria exist, but they must be adapted for scientific and engineering purposes. Required modifications include using floating point numerical representations and application sensitive notions of matching or nearly matching.

Our approach to the diversity of users and their uses of the data is to develop software for applications that have visible users. Then get as many users as feasible involved in the prototype development stage, and listen for what they are trying to accomplish, not just what they think they want to see.

As a visualization researcher, I regard visualization and analysis difficulties as opportunities. So I'm excited by the range of possibilities available for research here, for example in systems which support users in avoiding visual conflicts in their data presentation. Comparison of data from multiple sources or from a single source at different times also provides a wide range of possibilities many of which may be useful in one context or another.

Descriptions of much of the work mentioned can be found at http://science.nas.nasa.gov/Groups/VisTech/