Talk mlpy - Machine Learning Py - A High-Performance Python/!NumPy Based Package for Machine Learning

Presented by Bruno Kessler in Scientific Applications 2008 on 2008/07/27 from 14:15 to 15:00
Abstract
Obtaining honest performance estimates from a machine learning experiment usually requires fulfilling a complex pipeline of simpler tasks. Those steps can be organized inside a Data Analysis Protocol (DAP) tailored by the researcher as suitable for the investigated problem typically a predictive classification or regression task. As a very basic example, a binary classification experiment can be structured by a k-fold cross-validation with internal feature ranking performed at each split. We propose mlpy as an Open Source package collecting several modules they implement different flavours of the machine learning functions required in each classification, feature-ranking and feature-listsanalysis experiment. In particular, mlpy provides high level procedures which guarantee high modularity and ease of use. These features allow researchers, even those not particularly inclined to programming, to construct their own methodological procedure still mantaining good computational efficiency. Although mlpy is suited for general-purpose machine learning tasks, its elective application field is bioinformatics and, in particular, the analysis of high-throughput data such as genomics and proteomics, where input data can easily reach dimensions of thousands of samples described up to onemillion of features (e.g. SNPs array data). Furthermore, we can use modularity to alleviate the computational burden by distributing the processes on a HPC facility such as a cluster or a grid infrastructure. The modular structure of mlpy allows easily adding new algorithms in each category. The mlpy package makes an intensive use of the NumPy module: its strong support for integration with C code has allowed us to implement as internal C functions the parts with higher computational costs.
tagged by
no related entity