Talk New developments with Scikit-learn: machine learning in Python

Presented by Jaques Grobler in Scientific track 2012 on 2012/08/25 from 16:00 to 16:30

The Scikit-Learn library is a general-purpose toolkit for implementing machine-learning algorithms in Python. Here we give an update on latest additions to the library.

Overview of the scikit-learn

Scikit-learn combines the classic machine-learning algorithms with the existing scientific programming packages for Python, such as numpy, scipy and matplotlib. With it's focus on ease-of-use, it aims to provide simple and efficient solutions to many learning problems which can be reused in various contexts. It can be utilized by people that don't come from a scientific-programming background, while still being a versatile machine-learning tool for both science and engineering.

Considering the broadness of the machine-learning and its applications, the Scikit-Learn package offers a truly wide array of tools that are easy to utilize and customize to meet one's specific needs. It extends it's applicability from the various, well-known supervised learning problems ranging from regressors, feature selection to discriminant analysis, as well as to the different clustering and signal-decomposition problems that fall under unsupervised learning. Furthermore, it also provides very effective model-selection tools such as cross-validation and allows for estimators to be chained together in pipelines, allowing the users the freedoms to easily experiment with combinations of algorithms on their problems. The package also gives one access to several datasets for testing, tools for pre-processing data and much more.

An important feature of the package, is its emphasis on not being difficult to understand and aims to keep it's interface simple, consistent and non-complicated. People are able to apply it's algorithms to their problems without needing a thorough understanding of machine-learning or scientific-programming, as the documentation, interface and tutorials make this package easy to learn, understand and use.

New developments

A valuable asset to the scientific-programming community, the library continues to grow and improve as many machine-learning enthusiasts get involved.

Specifically, we detail in the following, major features gained by the scikit-learn in the last year (releases .9, .10 and .11).

Non-linear prediction models

Such models are useful for complex prediction problems, but may need a large amount of training data. For this reason, they have been implemented in the scikit-learn with special attention to computational efficiently. They are commonly used in settings such as computer vision.

Dealing with unlabeled instances
  • Semi-supervised learning: using unlabeled observations for better prediction (Label propagation).
  • Outlier/novelty detection: detect deviant observations.
  • Manifold learning: discover a non-linear low-dimensional structure in the data.
  • Clustering algorithm that can scale to really large datasets using an online approach: fitting small portions of the data on after the other (Mini-batch K-means).
  • Dictionary learning: learning patterns in the data that represent it sparsely: each observation is a combination of a small number patterns.

Sparse models: when very few descriptors are relevant

Below are a few examples from the above-mentioned developments.

For more information, see here
tagged by
no related entity