Scikits.learn (http://scikit-learn.sourceforge.net/) is a scikit for machine learning which has gained lots of popularity in recent months. In particular, it can be used for text and large scale database mining.
On another side, CubicWeb (http://www.cubicweb.org/) is a python-based framework for semantic web applications that has been used in different application fields (library, museum, conference, intranet applications).
The aim of this talk is to present how these tools can be used together for semantic data mining of rss feeds (clustering, prediction), and for building a news aggregator similar to google news.
In a first step, we load a part of DBPedia (http://dbpedia.org/About) into a database using CubicWeb. This allows easy and fast access from Python to the name and category of a large number of entities described in Wikipedia (e.g. people, countries, ...).
We use this database to extract named entities (i.e. entities that are recognized as being part of Wikipedia, and that can be characterized by a unique URI) from the articles of RSS feeds. This operation relies on the text extraction module of Scikit.learn and the text comparison function of PostgreSQL. It is possible to extract the named entities of an article of dozens of lines in a few seconds. At the end of this step, we obtain a matrix X, each row of X being an article, each column being a named entity that has a unique URL in DBPedia/Wikipedia.
This approach can be viewed as a robust feature extraction method for text mining, and allows to drastically reduce the dimensionality of the data. Indeed, with a commonly used Tokenizer, we could obtain more than 100.000 tokens within hundreds of articles, where our approach reduces the dimensionality to only few thousands of recognized entities.
Example of named entities recognition from http://www.washingtonpost.com/:
SHABQADA, Pakistan (http://dbpedia.org/resource/Pakistan)- Twin suicide bombings outside a paramilitary training center in Pakistan's (http://dbpedia.org/resource/Pakistan) northwest killed least 80 people early Friday, in what appeared to be militants' first major retaliatory attack since the death of Osama bin Laden (http://dbpedia.org/resource/Osama_bin_Laden).
In a second step, we applied the hierarchical clustering of Scikits.learn to the matrix of data X, resulting in a hierarchical tree of articles, that are aggregated based on the presence of some particular entities.
We do not set a specific number of clusters, as we are only interested in the hierarchical tree, that can later be explored through a Web interface. This allows to focus on some particular clusters of news, that are well characterized by the entities that are shared by articles across the same cluster. Moreover, keeping named entities as features is a good way to synthesize information while keeping the primary meaning of the text.
In one single day, more than 1500 articles can be received from a set of a dozen RSS feeds for English newspapers (e.g. http://feeds.bbci.co.uk, http://www.guardian.co.uk, …), yielding a matrix X with a typical size of (1500, 1000). The hierarchical clustering is done in a few seconds.
In conclusion, we present a new approach to aggregate news from RSS feeds that is based on named entities listed by DBPedia. This method allows to drastically reduce the dimensionality of the data, while keeping the primary meaning of the text. This approach is done within a full Python framework, using Scikits.learn and CubicWeb.