PyPLN: a Natural Language Processing Pipeline in Python
Flávio C. Coelho, Renato Rocha Souza and Alvaro Justen
With the growing availability of unstructured masses of textual data, created in many of nowadays technology mediated human activities, natural language processing (NLP) techniques have seen an extraordinary growth, with many fields of application. As the markets demand for this kind of analysis grows, the development of new computational tools in NLP are modeled by specific demands.
Despite the availability of a myriad of commercial and very good quality free open source solutions (e.g GATE, OpenNLP, UIMA), and even complete programming frameworks for NLP (e.g NLTK for Python, NLP tasks for R), there is no accessible solutions that aggregates the ease of use of Python with powerful and scalable NLP common tasks
At the center of any text mining solution is the need to quickly put together custom distributed analytical workflows. Custom solutions are key due to the ample applicability of these tools: For every new data set there is a large set of original analyses which can be proposed. Though the number of possible analytical scenarios is very large, they can be constructed by recombination of relatively simple smaller analytical steps. For example: No textual analysis can be performed without some form of tokenization and part-of-speech tagging. Fortunately, the Python ecosystem already provides excellent libraries for this kind of analyses. One key example is NLTK, which provides most of these simpler analytical tools for more complex analyses. Unfortunately, merely having a good library for doing NLP, is not enough. The most interesting problems in NLP involve large volumes of data which cannot be neither stored nor processed on a single PC.
PyPLN is a distributed pipeline for text analytical analyses. Applying the philosophy of Unix pipeline processing to distributed computing on a cluster environment, we provide a tool capable of handling large volumes of data while remaining simple to deploy and program for customized analyses. PyPLN implements distributed processing by relying on ZeroMQ for handling communications on the cluster. Job specifications and data are farmed to worker processes in the pipeline as JSON messages which are balanced among the cluster. Every step of the pipeline can be parallelized as long as it applies to more than one document. Finer grained parallelization is also possible with the implementation of custom workers. Workers are typically Python scripts which can on its turn call any executable or library available in the system. A common bottleneck in the analysis of large collections of documents is disk IO. PyPLN relies on Mongodb to store all of its data and analytical results. Even a single Instance of Mongodb proves much more performant than directly file system storage, but if even more speed is necessary the database can be easily replicated and/or sharded. Any collection of documents in the PyPLN database is full-text searchable through integration with Sphinx search.
PyPLN also offers a web-based IDE for interactive text analysis, which can also be used to design and run analytical pipelines.
PyPLN is free software licensed under GPLv3.