DAISY Pipeline Next Generation - Arguments and Principles
Editors: Markus Gylling, Romain Deltour
Note: This page is now deprecated. Development of the Pipeline 2 project has started and is now hosted on Google Code.
Table of Contents
Introduction
The DAISY Pipeline project, started by the DAISY Consortium in 2005, originally defined as its overarching scope 1 2
"to provide a component-based framework for file and file-set conversion. The primary target user groups of the system are tool providers and content producers associated with the ANSI-NISO Z39.86 standard."
In mid 2008, the DAISY Pipeline has been introduced in production environments all over the world as a standalone application. Pipeline services are appearing integrated in other applications such as MS Word, Open Office and Audacity. The subproject PipeOnline 4 is, in collaboration with NLB Norway, building a framework for deploying the DAISY Pipeline as a web service on both intranets and extranets.
The increasing use of the Pipeline in various and heterogeneous environments raises new issues of cross-platform integration. In the mean time, some technologies that were non existent or bleeding edge at the Pipeline debut may now be mature enough to be adopted in a Pipeline Next Generation concept.
The Pipeline Next Generation Concept
When we say "Pipeline Next Generation" we are referring to an architectural redesign of the Pipeline core system. The following provides a motivation for such a redesign.
The original Pipeline design 3 defined principles such as
- Loosely coupled contract-based transformation entities (via Java Interfaces called "Transformers")
- The ability to chain Transformers together in Scripts to build advanced compound transformations
- Ability to introduce and load new Transformers without recompilation or application restart
- Desktop and Web service deployment options
- Cross platform support
While the current DAISY Pipeline fairly well succeeds in achieving these goals, its overall architecture and cross-platform viability can be significantly improved by:
1. Reusing established technologies
At the time of creation of the original Pipeline Core design, there were no existing or established technologies, frameworks or standards that could be used and built upon to achieve the desired feature set. Now, in 2008, the domain of automated, SOA and XML-centric transformation services has expanded, and there are several technologies that are viable candidates for a Pipeline Core redesign effort. We believe that a Pipeline architecture that builds upon mainstream technologies and standards is preferable since it is more future-proof, and since it could decrease the effort needed for infrastructural inclusion at the individual organization level.
2. Abstracting a platform-neutral core from its implementation(s)
Whereas the current Pipeline design satisfies the cross-platform requirements (based on Java, it is currently known to run on Windows, Mac OSX, IBM AIX and Linux), there is reason to think about the cross-platform compatibility in other terms than supported runtime environments. As will be discussed further in the technology section below, the use of platform-neutral technologies such as XProc, coupled with extended use of XSLT 2.0, would allow us to achieve a separation between a more abstract Pipeline API and its (possibly multiple, parallel) implementation(s).
3. Providing cross-platform interfaces via built-in bridge support
As an alternative solution to having parallel implementations as discussed in item 2 above, one could envision explicit framework bridges to be given first-class citizen status in a single reference implementation. On the contrary to the solution presented in item 2, a bridge wrapper approach would eliminate the risk of duplication of effort while still achieving the desired organizational infrastructure transparency.
The Pipeline 2.0 core will higly likely be based on the Java platform, to leverage the legacy code and skill set acquired with the Pipeline 1.0 project.
Emerging Technologies - A Primer
In this section we intend to describe a few technologies which would be at the heart of the Pipeline Next Generation: XSLT 2.0 (along with XPath 2.0), XProc, and OSGi.
XLST 2.0 and XPath 2.0
As a fair amount of the Pipeline functionality involves conversion or manipulation of XML data, it seems natural to focus particularly on XPath and XSLT:
- XSLT
- XSLT is a template-based markup language for transforming XML documents into other XML documents.
- XPath
- XPath is an expression language for selecting nodes in an XML document, designed to be embedded in a host language (XSLT or XQuery).
Both XPath 1.0 and XSLT 1.0 recommendations were first issued in 1999. While they quickly became popular technologies, a few technical shortcomings sometimes led stylesheet developers to use tedious workarounds or proprietary extensions to achieve their goals, which impacted their overall productivity. These issues have been significantly improved by XSLT 2.0 5 and XPath 2.0 6, developed by the XSL and XQuery Working Groups and published as W3C Recommendations in January 2007.
The following is a non exhaustive list of the powerful improvements featured in XSLT 2.0 and XPath 2.0:
- generalization of every XPath 2.0 value as a sequence of items
- deprecation of XSLT 1.0 result tree fragments in favor of node sets
- support for grouping nodes
- support for multiple output documents
- improved control structures (for loops, conditional tests)
- support for regular expressions
- possibility to define reusable functions
- introduction of many new built-in functions
- backward compatibility with XSLT 1.0
The DAISY Pipeline already relies on XSLT stylesheets for core functionality such as DTBook to XHTML conversion, DTBook forward migration, DTBook fixing, etc. The ability to easily integrate an existing stylesheet as a Pipeline transformer has always been one of the major selling points of the Pipeline to the community. The XLST 2.0 and XPath 2.0 improvements in terms of code quality and productivity will allow us to promote even further the use of stylesheets in the next generation Pipeline. If a piece of functionality can be implemented via XSLT, then it will be the preferred choice over any alternative solution (including Java).
XProc
- XProc
- XProc is a markup language for describing processing pipelines for operations on XML documents. It is developed by the XML Processing Model Working Group and currently published as a W3C Working Draft7, at the Last Call stage at the time of this writing (September 2009).
The concepts behind XProc are fairly simple: Pipelines are made up of steps which can have zero or several inputs and zero or several outputs. Steps can be either atomic (i.e. perform a unit of XML processing) or compound of other steps (where the output of one becomes the output of another). Additionally, steps may be configured with options and parameters.
Here is a very basic example of an XProc document, taken from the XProc specification:
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc"> <p:choose> <p:when test="/*[@version < 2.0]"> <p:validate-with-xml-schema> <p:input port="schema"> <p:document href="v1schema.xsd"/> </p:input> </p:validate-with-xml-schema> </p:when> <p:otherwise> <p:validate-with-xml-schema> <p:input port="schema"> <p:document href="v2schema.xsd"/> </p:input> </p:validate-with-xml-schema> </p:otherwise> </p:choose> <p:xslt> <p:input port="stylesheet"> <p:document href="stylesheet.xsl"/> </p:input> </p:xslt> </p:pipeline>
In this example, the "choose" step tests an XPath expression over input source document. Based on the result, a branch consists of a single validate step is executed, then chained to a final XSLT transformation step. Note that XProc implicitly defines a "source" input ant "result" output for steps, to reduce the syntax verbosity.
The XProc language is quite terse. It uses a small vocabulary of core elements that provide the ancillary support for declaring steps and connecting them in sub-pipelines. In addition to this language, the XProc specification fully defines a standard step library that covers common XML processing operations: element insertion, deletion, replacing, renaming, wrapping, XSLT, etc.
The DAISY Pipeline framework with its script and transformer paradigm is conceptually very close to XProc. We thus believe that we can leverage on the XProc specification to bring a more formal and robust core to the Pipeline Next Generation. The main advantages of this approach are:
- XProc is declarative and platform neutral language: XProc Pipelines can potentially be executed by any XProc engine, independently of its underlying implementation platform (Java, .Net, Python, etc).
- XProc is a formal and well-documented specification: this ensures a good readability and facilitates the authoring of XProc Pipelines.
- The wide adoption of W3C standards ensures robustness and maintainability: the specification benefit from the "design check" induced by collaborative edition and peer reviews.
- The XProc standard step library provides a strong and broad functional core: the standard steps are by contract implemented by all the conforming XProc engines.
- The popularity of W3C standards increases the attractiveness factor: implementing or extending an XProc engine could potentially raise interest in our developments outside of the mere DAISY community.
As mentioned in the XML Processing Model requirements 8:
[The XProc specification] is concerned with the conceptual model of XML process interactions, the language for the description of these interactions, and the inputs and outputs of the overall process. This specification is not generally concerned with the implementations of actual XML processes participating in these interactions.
This gives a good insight on the potential approach that could be taken by the Pipeline Next Generation, which would use XProc as an underlying framework and bring custom functionality via:
- language extensions in a custom namespace: to introduce Pipeline-related meta data (such as parameter data types) and informative content (such as nicenames, description, documentation)
- extensions to the step library: to provide built-in support for reusable pieces of functionality (file set generation, audio encoding, TTS-based narration, etc)
OSGi
The last technology described here is less linked to the high-level concepts of the Pipeline Next Generation than to their underlying implementation. This technology - OSGi - provides a technical framework for the development of flexible, modular, component-based software.
- OSGi
- The OSGi Service Platform is an open specification 10 of a dynamic module system for Java, published and maintained by the OSGi Alliance 11.
OSGi defines a component model were every piece of software is a module aka bundle. Each bundle is a jar (Java library) or directory and has a manifest file that defines:
- the version number: enables a finely grained management of bundles compatibility and improves the overall system maintainability.
- visibility rules: describes what part of the bundle are made available to other bundles.
- dependencies: defines the runtime requirements of the bundle, to enable dynamic loading of bundles
A simple example of such a bundle manifest is:
Manifest-Version: 1.0 Bundle-ManifestVersion: 2 Bundle-Name: DAISY Pipeline Bundle-SymbolicName: org.daisy.pipeline Bundle-Version: 1.0.3 Bundle-Vendor: DAISY Consortium Require-Bundle: system.bundle Import-Package: org.daisy.util Export-Package: org.daisy.pipeline, org.daisy.pipeline.core ...
Additionally, a bundle may provide services that it publishes to a system-wide registry, and can reversely subscribe to services provided by other bundles. Technically, services are contracts described as Java interfaces and implemented by concrete classes.
__________________
| Service Registry |
|__________________|
^ ^
/ \
publish / \ discover
/ \
v v
__________________ ___________________
| Service Provider | | Service Requester |
|__________________| |___________________|
This OSGi component model, along with its service oriented architecture and managed by the OSGi runtime, offers many benefits:
- a strongly dynamic nature: OSGi precisely defines the bundles life cycle, how they can be installed, started, stopped, updated, and uninstalled without bringing down the whole system, and how services can be registered and consumed at runtime.
- a facilitated reusability: the dependency model promotes the reuse and sharing of libraries as bundles.
- a great adaptability: the dynamic service model allows a service consumer to choose between several alternative implementations of a same piece of functionality.
- an overall operational control: a new system can be configured by merely associating several existing bundles ; the dynamic runtime and dependency model ensures an easy deployment and a great extensibility of such a system.
Several lightweight and fast implementations of the OSGi Service Platform Release 4 are currently available, such as the open source Eclipse Equinox project 13 and Apache Felix project 14. The use of OSGi is widespread across several heavyweight Java software communities such as the Eclipse or Spring, and also serves as the basis of an increasing number of server-side Java platforms (GlassFish? v3, JBoss, BEA Weblogic, Jonas).
The DAISY Pipeline project already relies on the OSGi component model for its Eclipse RCP-based desktop graphical user interface (the Pipeline GUI). The Pipeline Next Generation could leverage on the Pipeline legacy java code base and the OSGi platform to implement a highly modular and dynamic software framework: imagine the flexibility of a system were every transformer (XProc step ?) would be defined as a service in an OSGi bundle, dynamically loaded when required or after an software update...
Deployment Scenarios
Based on the technologies previously introduced, we can imagine several scenarios for the deployment of the Pipeline Next Generation system.
A platform neutral core with several parallel implementations
Note: the diagram above features both Java and .Net implementations but can be generalized to any other potential platform (e.g. Python)
In this scenario, the Pipeline Next Generation consists in a platform-neutral core which allows the definition of scripts executable by several engine implementations. The platform-neutral core relies on the XProc syntax augmented with the Pipeline extensions, and a set of steps coming from the XProc standard step library and the Pipeline custom step library. Many steps are already implemented with (platform-neutral) XSLT 2.0. Each implementation, whatever platform they're based on (Java, .Net, Python, etc) must feature a genuine XProc engine, must support the Pipeline-extended syntax, and must provide implementations of the Pipeline custom steps (e.g. TTS Narration, audio encoding, etc). When these requirements are respected, they can execute the Pipeline XProc scripts with consistent results.
We here achieve a situation where organizations will be able to introduce and implement the Pipeline in their infrastructure without necessarily having to migrate or bridge from their in-house preferred platform (.Net, Java, etc), at the cost of a duplication of the implementation effort.
A main implementation manipulated through built-in bridge support
Note: the diagram above features a bridge to a .Net system but can be generalized to any other potential platform (e.g. Python)
In this scenario, the Pipeline Next Generation consists in a main Java-based reference implementation, covering the platform-neutral XProc syntax with Pipeline extension, the steps libraries, and the underlying runtime engine. The implementation is based on OSGi and is made up of several bundles that can be installed or removed to modulate the system feature set. On top of this main system, the reference implementations can provide several built-in interfaces to the outside world: the native Java API, a commercial Java/.Net bridged API, a web service layer (with a RESTful API or/and a WS-* API). The organizations can choose the preferred interface to access and manipulate the Pipeline functionality.
We here achieve a situation where organizations will be able to reuse a single reference implementation of the Pipeline from their in-house preferred platform (.Net, Java, Python, etc), maybe at the cost of a coarser grained manipulation functionality.
Further Reading
- DAISY Pipeline Original System Requirements
- DAISY Pipeline Original High Level Requirements Final Version
- DAISY Pipeline Original Design and Implementation constraints
- The PipeOnline Project
- XSLT 2.0 - W3C Recommendation
- XPath 2.0 - W3C Recommendation
- XProc: An XML Pipeline Language - W3C Working Draft
- XML Processing Model Requirements and Use Cases
- Discovering XProc, IBM Article from James R. Fuller
- The OSGi Service Platform Release 4
- The OSGi Alliance Home
- Overview of the OSGi Technology
- Eclipse Equinox
- Apache Felix
Attachments
-
parallel-implems.jpg
(57.4 KB) -
added by romain-deltour 3 years ago.
new diagram created by Greg Kearney
-
bridge-support.jpg
(47.6 KB) -
added by romain-deltour 3 years ago.
new diagram created by Greg Kearney


