• Home
  • Alerts
  • About
  • Services
SafeSearch:  On

Download taming-works10.pdf

File Info : Taming Complex Bioinformatics Workflows with Weaver Makeflow and Starch

Contents : Taming Complex Bioinformatics Work ows with Weaver Make ow and Starch Andrew Thrasher Rory Carmichael Peter Bui Li Yu Douglas Thain and Scott Emrich Department of Computer Science and Engineering University of Notre Dame Notre Dame IN Abstract In this paper we discuss challenges of common bioinformatics applications when deployed outside their initial development environments. We propose a three-tiered approach to mitigate some of these issues by leveraging an encapsulation tool a high-level work ow language and a portable intermediary. As a case study we apply this approach to refactor a custom EST analysis pipeline. The Starch tool encapsulates program dependencies to simplify task speci cation and deployment. The Weaver language provides abstractions for distributed computing and naturally encourages code modularity. The Make ow work ow engine provides a batch system agnostic engine to execute compiled Weaver code. To illustrate the bene ts of our framework we compare implementations show their performance and discuss bene ts derived from our new work ow approach relative to traditional bioinformatics development. cise implementation of many common distributed computing patterns. Weaver programs are compiled into make-like low level work ow descriptions that can be executed through the highly portable Make ow work ow engine. In Section VI and VII we leverage these tools to refactor an internally developed EST analysis pipeline into a maintainable and easily deployed work ow and compare this refactored pipeline to its predecessor with respect to its conciseness maintainability robustness and portability. I. I NTRODUCTION When leveraging the parallelism in their software many development sites focus on a particular distributed resource because it is the only resource accessible to them and they don t possess the resources or expertise to develop interfaces for more systems. As a result few applications are developed with the exibility to utilize a variety of distributed resources. This introduces serious challenges when organizations attempt to deploy computationally intensive tools without access to the same computing resources as the original developers. The rapid production of Bioinformatics applications by academic institutions has led to the development of many powerful and useful tools. Many of these tools require signi cant computational resources for any nontrivial task. Further increasing numbers of research groups are deploying externally developed bioinformatics applications in their own computing environments and modifying or building upon them. Unfortunately deployment and modi cation of bioinformatics applications has proven challenging. Commonly encountered challenges (Section II) include dependence on speci c distributed computing resources complex program dependencies and intermingling of conceptually distinct tasks within code. The rst two problems severely impede deployment efforts while the third undermines customization debugging and code maintenance. Here we propose to separate these problems and address each with a different layer of a software stack. Using a production pipeline described in Section III we implement the three tiers described in Section IV: an encapsulation layer to reduce complex webs of coordinated programs and libraries a high level work ow language to concisely and intuitively describe this pipeline and a portable low level work ow language and execution environment. In Section V we describe three tools developed by the University of Notre Dame s Cooperative Computing Lab which mitigate the previously mentioned problems. Starch provides a method for packaging complex program dependencies into a single executable archive. Weaver is a Python-based high level work ow description language with support for con- II. C OMMON C HALLENGES IN B IOINFORMATICS A PPLICATIONS A. Portability B. Software Maintainability Bioinformatics users often need to handle hypotheses and data outside of the application s original scope. Such dif culties are particularly pronounced in code bases that lack modularity implement their control logic at very low levels (rather than describing it through higher level work patterns or abstractions) or perform their function indirectly through the runtime generation of code. Some programs achieve a modest degree of modularity but are implemented with low level control and programmatically-generated intermediate executables 1 2 . C. Dependency Management Many bioinformatics applications feature tasks with a high degree natural parallelism. Naturally bioinformaticians (people developing tools for biologists) take advantage of this by running work on many nodes often in grid or cloud settings. The execution of such workloads depends on the ability to transport the required dependencies for each task to each worker node. However many bioinformatics tool
  • Rating :      
  • Search Skype/AIM!
  • File Type : .pdf
  •    
  • Length : 6 pages
  • File Size: 107.8 kb
  • Virus Tested : No
  • Verified : 2012-08-09
  • Source: www.cse.nd.edu
 Email File   

INFO HASH : 55e639c9c7bd132458c50719229653db987a9a4b
blog comments powered by Disqus
Download now

File Size: 107.8 kb

Document Preview

    Other Downloads

  • 7410.pdf10.4 mb
  • shawncv_january_22_2012.pdf71.3 kb
  • cllmajor7-11.pdf73.5 kb
  • 21.pdf405.3 kb
  • csd-89-548.pdf24.2 mb

    Related Keywords

  • papers  ~ccl  research  

  • Add Media
  • |
  • Terms of Use
  • |
  • FAQ / Help

© 2012 all rights reserved