Talk MDP: Modular toolkit for Data Processing (and its new features)

Presented by Niko Wilbert in Scientific track 2010 on 2010/07/10 from 15:00 to 15:15 in room Dussane
authors: Pietro Berkes, Rike-Benjamin Schuppner, Niko Wilbert, Tiziano Zito

The Modular toolkit for Data Processing (MDP) is a library of widely used data processing algorithms, and offers a framework to combine them according to a pipeline analogy. This makes it possible to build complex data processing software in a modular fashion. MDP helps users to implement new algorithms, which can then be used in the framework. The modularity of MDP has enabled the addition of new capabilities, like hierarchical networks and parallelism. MDP is also designed to be embedded or used by other libraries (current examples are PyMVPA, PyMCA, or ORGANIC).

The core of MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that are encapsulated in nodes with a standardized interface. The nodes can be combined into data processing sequences (flows). Given a set of input data, MDP takes care of successively training or executing all nodes in the flow. This allows the user to construct complex algorithms as a series of simpler data processing steps. MDP provides training options like training in chunks to reduce memory consumption for large data sets. The base of available algorithms is steadily increasing and includes, to name but the most common, Principal Component Analysis, several Independent Component Analysis algorithms, Slow Feature Analysis, Gaussian Classifiers, Restricted Boltzmann Machine, and Locally Linear Embedding. We have been expanding this basic core in various directions and MDP has gained several subpackages that extend the framework. In this presentation, I'm going to explore these extensions in depth (especially the ones added since the last EuroSciPy), including real-life, scientific examples.

  • The hinet sub-package adds components for the construction of hierarchical networks. This has been used, for instance, to create large multi-layer networks for object recognition. Due to its modular nature, one can implement general feed-forward graphs with this package. For the visualization of complex networks MDP offers the automatic generation of HTML based representations (to be viewed in a browser or embedded into a custom GUI).
  • The parallel sub-package offers a parallel implementation of basic nodes and flows, enabling convenient parallelization for embarrassingly parallel problems. The parallelization can utilize multiple cores or machines, and requires practically no code changes on the userĀ“s end. MDP only depends on an abstract scheduler interface, making it possible to use custom scheduler solutions if required. In addition to our own thread and process based MDP schedulers we offer an adapter for the Parallel Python library, and more adapters might be added in the future (e.g., using a message passing framework via mpi4py).
  • Adding new capabilities to nodes (e.g., for parallelization) generally requires that the new code is somehow injected into the existing classes. Normal inheritance does not scale well for this, and can quickly create an inheritance nightmare. Therefore, MDP introduced a node extension mechanism, so that new capabilities can be implemented in one place and enabled as needed. This mechanism is based on ideas from aspect oriented programming and enables users to add entirely new capabilities to MDP nodes if needed.
  • The most recent and largest addition to MDP is the BiMDP package, which extends the purely feed-forward flow processing of MDP with the ability of bidirectional data transfer. This makes it possible to use the MDP framework for a much larger class of algorithms, like neural networks with backpropagation or deep belief networks. Of course, the hinet and parallel subpackages are also compatible with BiMDP. Bidirectional data flow in complex networks can be painful to debug, and therefore BiMDP includes a browser based visual inspector to step through the flow. The inspector can be customized by users and is useful for visualization as well (e.g., to show plots of intermediate data).