SDI: Minutes for July 8 meeting in Bethesda
Notes from NIH Roadmap Software and Data Integration (SDI) meeting: July 8, 2005
The meeting agenda, and entry to the wiki discussion on SDI is: http://www.na-mic.org/Wiki/index.php/SDI:_Agenda_for_July_8_meeting_in_Bethesda
8:00 – 10:00 PM Initial Presentations from Four Centers
Mike Sherman (Simbios): General solution doesn’t benefit an individual. Separate modeling from computation. For delivery, build centralized layers: Applications, modeling, computation. Strategy for software repository—federation, Gforge is likely effective for this. They want to (have to) involve collaborators at different levels of openness. Curation of software is huge service. Key observation: open source is not the same as open development—this is basically the same as having a small number of committers. Application developer doesn’t know how to make models. Users will have to receive fully-formed applications and not be expected to understand workings of all components—at least early on. Sharing hardware is difficult. High performance computing—DOE DARPA.
Bill Lorensen (NA-MIC): Frameworks (defined as the superstructure of the application) is a demand-driven data processing pipeline. Shows the new architecture of Slicer 3. The goal is to use a process that will avoid having to throw software away. They have worked with other Centers, e.g., using Slicer to do segmentation of` hand and knee at Stanford, worked on COPD with staff at I2B2. Challenges: NA-MIC identity problem—how will they/NCBC/NIH get credit? Licensing is an issue. Combining teams with mixed skills is an issue. NA-MIC is a distributed team. They are thinking about their long-term business model.
Ivo Dinov (CCB): Subject areas of research: Non affine volumetric registration; shape; level-set methods; Conformal mapping; volumetric image segmentation; biosequence analysis. Challenges: Brain mapping (integration across species, modalities, …); software and hardware engineering; infrastructure and computing; scientific validation. Software development: stages, active development, alpha-beta distribution. Use Gforge for software development and engineering. Grid data pipeline. Interactions with other Centers: CCB-NA-MIC SW Integration, Slicer-LONI pipeline. CCB-I2B2: Hive cells; pipeline models. The strategy for building out to the national infrastructure should involve a staged process: first get the intra-Center working; then inter-Center; then full blown build out with collaborating R01s and other collaborators.
Henry Chueh (I2B2 part 1): Key initial effort--clinical research chart (CRC). CRC is combination of genotype and phenotype. Frameworks to allow development of application services in maximally decoupled effort. Complex choreography. Transactions and semantics. Exposing cells—WSDL integrators. Expose function to investigators—automators. The Center is investigating the use of Kepler as pipeline. Interoperability across NCBCs: web services, consider sharing data sets.
Shawn Murphy (I2B2 part 2): Web services manages versioning and data provenance.
Questions: Russ Altman: how about data that isn’t massaged into CRC? They manage pipeline. Run their analytic workflow off that. Russ: are there any problems with performance of web services flow? Shawn: web services is good for low performance. Isaac Kohane when the data involves images then XML cost is less. Jennie Larkin: how are ontologies developed and shared? Shawn: is really hierarchy of vocabularies. Is very labor intensive. Bill Lorensen: there is open source Protégé. Shawn the complexity is significant. Steve Wong: do they have images? Shawn: Not much images. Will get images in same way as Steve—uses BIRN. Isaac it will be important to use ontologies.
10:15 – Noon bio-Software Engineering:
Talking Points: • Are there any organizing principles across the heterogeneous Centers? • SE workshop and how to disseminate cumulative and unique knowledge of biosoftware engineering—balanced with bottom up interoperability workshop. • What does it take to federate software repositories? Share, make components common. Minimum standards. • What is interoperability (co-location of data? cholesky or Frameworks? Pipeline?). • Gestalt four centers: o NA-MIC—classical open source; open development/frameworks. Data BIRN o I2B2 LGPL, but seek HIVE cells/maximally decoupled effort, flow control and pipeline. Web services. Do not use pre-built applications. Use web services. Data clinical research chart mined from Partners data (IRB?); Ontologies: should we develop our own or interoperate with wrappers? o Simbios: open source not necessarily open development—will possibly evolve to more open development. Pre-built applications. Data is more a back end issue. o CCB—open source but not necessarily open development, pipeline interoperability. Data BIRN? • How far across NCBCs will we go with open source? Geographically separate investigators working on the same software components. Forge type repository in neutral ground for applications (think Slicer)
The idea of clearninghouse. Ron Kikinis thinks NIH would need to be involved. Low risk. Can be real. All NCBCs can contribute.
How do we build out to National Centers: intra-Centers, inter-Centers, build out with R01s and other funded effort. NIH forge. What should NIH do? Ron thinks PIs can get consensus—with SDIWG. caBIG? Don’t duplicate. Should this be done with a contract? Terry Yoo: Consolidate NIfTI with caBIG is tough enough. We need to operate first before we interoperate.
Seque into NIH forge. Should we pay a contractor to do it at NIH? This is developer level. Eventually there will have to be cross-Center place where people can go. Facilitate from developer to user community. Grace Peng: which community? Here is where the way the NCBC using ‘forges’ can guide the NIH community.
Bill Lorensen: Forge federation across centers is already happening. LONI interim releases. They have already had cases where developers have earned commit rights earned commit rights—e.g., a scientist at Stanford has commit privileges with Slicer. CCB-NAMIC for LONI pipeline. Simbios uses VTK.
Bill Lorensen: Open source development can be hierarchical, e.g., Slicer at the application level while ITK is at the component level. Mike Sherman also says that the issues that Simbios is facing regarding federation of repositories are similar to what will be faced across Centers and with the larger community.
Arthur toga: we won’t put our arms around all biocomputing software engineering. What are the low hanging fruit? Common software repositories where it makes sense? Interoperability based on pairwise interactions.
Yuan Liu: clearinghouse/NIH forge. Two levels (i) developers (ii) provides to non-sophisticated users. NIH is interested in both. Can you (NCBC staff) provide technical help at inter-Center level and intra-Center? Sharing best practices. Heterogeneous. There are layers of sophistication that need to be respected. Yuan: there are heterogeneous efforts. Peter Highnam: caBIG professionally managed over 500 (not just a contractor throw over the fence). Need to distinguish between large number of people working out of single CVS or a federation of CVSs? Ron Kikinis: why isn’t there an intramural component?
Cherri Pancake: Real goal is to make systemic changes. Includes intramural. Not forcing (standardization?). Could be inexpensive. Managing risk—version control (across Centers?) web based. Not cobbled. Jennie Larkin: NCBC NIH-forge, how? Bill Lorensen: you will linked … says it is federated.
Russ Altman: big NIH forge may fall over.
Discussion: Pubmed is a trusted resource. Is there a way of certification of quality?
12:30 – 1.30 PM IP Discussion
Each institution has a license which has been placed on their directory in the SDIWG wiki. GPL guarantees people won’t use due to viral issue. LGPL is better (less viral is that the only issue?). There is an additional need to deal with liability issue. Warrantee is problem. Putting unreliable software out there is problem. An important issue in development cycle is deciding when should software go open source? Russ has huge variance in when scientists are willing/want to make their software open. Isaac: Botstein microarray data open. Should we make/demand all software open? Some think: if it’s not open source it doesn’t exist. ITK gas no patented software in core of system. Keeps the license very simple. But are there any instances of some important software not being included (lose customers). VTK had patented directory. There are some cases where few customers have been lost. Susanne Churchill: what about collaborating R01s and their existing patent portfolios? Need to resolve. Arthur Toga: we shouldn’t promote a uniform approach. Should we catalogue various levels of license. Promote open source as carrot. CCB has same issues as Simbios with legacy code. Matlab and Python debate—they are ‘just’ software, right? The national infrastructure will include proprietary code. There will be toll roads. Try to make the main highways (courseways) open.
1:30 – 3:00 National Infrastructure Demonstrations
Suggestion: Should we frame a potential demonstration in terms of the Grand vision: from knowledge of SNP to molecular to phenotype. We will probably have to wait until the last three Centers are names in order to know what we’re working with. We will need to make infrastructure that is useful to non-power users. Like Google?
Discussion: In addition to Demonstrations, we need to put in place mechanism for credit, e.g., papers. Art Toga: Acknowledge that some connections are spontaneous. Henry Chueh: Issue of top-down vs. bottom up—we need to value both. How to enable the serendipitous interoperability? Chuck Friedman asks: does bottom up have to be spontaneous? What can we do to maximize the possibility of generating bottom-up interactions? SDI liaisons are chief bottom-up officer. Do something analogous to programmer’s meeting across Centers. Action: invite people to each other’s programmer’s workshops.
Issue of top-down (DARPA-like) demo. 18 months demo. Sometimes there is very little to carry along afterwards. There is a danger that result won’t be useful, and detracts from other efforts.
Young investigators: are impressionable to new connections. Hence we should foster new connections. Synthesize what we’re doing.
We should not limit our efforts to four Centers, e.g., it would be good to link the soon-to-be-funded Interagency multiscale modeling (MSM) efforts to Simbios. Ron Fedkiw represents another link to another agency effort. In general should record interagency interactions.
Karen Skinner: We should capture how is world different?
Don’t necessarily equate top-down with demonstration.
Sum up: cataloguing and record things that we think are in the future. Open access publication. Pubchem. Simbios is publishing a magazine.
Short term; Sharing of software practices. (i) Develop a web site with codified links to software, a la, the Internet Analysis Tools Registry http://www.cma.mgh.harvard.edu/iatr/, is this the logical start for NIH-forge?
Medium term; (i) NIH-forge: From the Z-gram: Develop a prototype high-throughput global search and analysis system that integrates genomic and other biomedical databases and software.
(ii) Demonstrations: Here is an incomplete list of pairwise interactions from our web site http://na-mic.org/Wiki/index.php/SDIWG:Ongoing_Discussion#Discussion_Points:_Interaction_Matrix--Current_list_of_possible_interaction_among_the_Centers_based_on_Computer_science_and_Domain_science_areas_that_overlap
Kikinis-Altman: Extending ITK/VTK approach for software development to SimTK. Links with imaging and modeling. Kikinis-Toga: Neuroinformatics shared interest. Mostly dealing with LONI pipeline. Kikinis-Kohane: Using imaging as characterization of phenotype (lung—COPD or asthma, DiGeorge syndrome). Database is also common ground through Partners (Glasser), and data sharing connection with BIRN. Altman-Toga: Database theory. Modeling. Altman-Kohane: Standards and database. Genotype-phenotype studies. Toga-Kohane: Genemap and Huntington’s. Image as phenotype. What could we do if we take off the table the issue of funding.
Long term. (i) Continued effort on NIH-forge: Close the loop between developers and users (ii) Demonstrations
3:00 – 3:30 Wrap Up
1. Wrap Up action items: NIH-forge: federation of repositories. Can we have geographically diverse people working on same components? We could volunteer to inform NIH-forge, e.g., examples. 2. SE symposium? 3. Interoperability demos? 4. Portal
Karen Skinner: Asks what is concrete components of NIH forge? Look at current (sourceforge, collabnet, gforge). A place to put code. Like a Pubmed central. Could be virtual or actual. NCBI is model for computational infrastructure. Terry Yoo: Pubmed is probably not the model. But more like genbank (or Pubmed central). Russ: it was downhill and obvious to check software into source. NIH forge got people more enthusiastic than interoperability demos.
Strawman action items for future.
1. NIH-forge ‘lite’ can be generated by a series of developer’s workshops that lead to exchange of commit privileges among Centers (i.e., were developers have to earn commit privilege through professional contacts). 2. An infrastructure effort along the lines of yellow pages (cf., IATR model http://www.cma.mgh.harvard.edu/iatr/info.php, or for ontologies the OBO http://obo.sourceforge.net/) a. component, b. applications, c. combine components and applications to make pipelines/workflows. 3. Down the road, evolution toward an “NIH forge” type effort. See below. 4. Demonstrations. To be determined.
Definition of NIH forge.
1. Yellow pages of available tools to users (IATR model http://www.cma.mgh.harvard.edu/iatr/info.php, or for ontologies the OBO http://obo.sourceforge.net/) a. component, b. applications, c. combine components and applications to make pipelines/workflows 2. Restaurant rating a. user evaluation for ease of use etc, b. techie evaluation benchmarking testing against datasets. 3. Environment for users communicating needs to developers (NA-MIC programmers week model). The developers should probably retain as close to their native work environment as possible—i.e., federate! 4. Developer environment, e.g., based on programmer’s week. 5. Expansion of developer environment (Dashboard)—geographically separated developers working on the same software to avoid duplication. Probably federated. 6. Code bank (cf. Genbank model)—probably virtual.