Talk scikits.learn, machine learning in Python

Presented by Fabian Pedregosa in Advanced tutorial track 2010 in room Dussane
Abstract

scikits.learn, machine learning in Python

Introduction

scikits.learn is a Python module integrating classical machine learning algorithms, providing simple and efficient solutions to various learning problems that are accessible to everybody and reusable in various contexts.

This module is mainly written in Python and makes extensive use of the NumPy/SciPy API for its numerical computations. Cython is also used to bind legacy C code and to speed up critical algorithms.

The project was started in 2007 by David Cournepau, and since has had various maintainers and dozens of contributions. In recent months there has been a renewed interest in this project, with a major project redesign and new contributions.

Overview

In this talk we present an overview of the classical machine learning problems (handwriting recognition, image denoising, clustering, etc.) and how this has crafted a pragmatic API to responds to common user cases. We present some concrete examples, and show how to solve them using scikits.learn.

Some machine learning problems (classification, regression) have robust algorithms implemented in scikits.learn, while others (most notably unsuppervised problems) are still much a work in progress. We give a brief overview of the algorithms implemented and the planned roadmap for one year.

Challenges

Machine learning algorithms presents several challenges. First of, as happens with most Python scientific packages, a pure Python module would not meed our speed requirements, so some parts must be coded in C . In our way, we faced some problems: portability, interfacing legacy C code and linking to optimized BLAS/LAPACK, to mention a few.

Also, the input data is often of sparse nature, thus sparse models must be used, which enforces the algorithms to work both in the case of sparse and dense models.

Other times, the calculations are so heavy that high-level pararellization is needed, forcing us to to blend well with python parellelization modules, notably Paralell Python . In such cases, special case must be taken when writing modules and interfacing C code.

We comment our solutions as well as an overview of current trends: mixing paralellization strategies (OpenMP and Paralell Python), cross-compilation and Cython optimization.

Other packages

However, scikits.learn is not the only package for machine learning in Python, so we present briefly similar python software, and compare them feature and speed wise with our package.