|||(1, 2) Applied Mathematics School - Fundação Getúlio Vargas.|
|||Law School - Fundação Getúlio Vargas.|
Large collections of text data represent a substantial challenge for the extraction of relevant bits of information to feed later statistical analysis and visualization pipelines.
The peculiarities of the knowledge domain of such a task often requires the implementation of custom code to handle certain tasks. The expressivity of the Python language, combined with its vast ecosystem of packages to handle almost any task imaginable, made it the best tool for the job, among the alternatives considered. Also, the high development productity which is possible with Python cannot be matched by any other solution.
Our challenge started with the the parsing of approximately 1.2 million legal case reports in html format scraped from the Brazilian Supreme Court website. For this stage urllib2 and Beautiful Soup were the main tools. The documents were parsed, segmented and stored in multiple tables of a MariaDb database.
The next big problem was to extract specific information from the yet larger collection of text fragments derived from the original documents. Information like names, dates, locations, legal citations, etc. Such bits would form the base of the quantitative analysis yet to come. Here, the main tools were the the re module, and the MySQLdb module, as we took advantage of the ability of MySQLdb to pipe highly optimized "raw" SQL queries to the MariaDb server. In fact, the lightness of the MySQLdb module allowed us to develop, almost seamlessly, both in Python and SQL, keeping heavier data reorganizations tasks in the database, therefore maintaining the performance of our Python pipelines.
Another branch of our analysis, which is still in development, is the application of natural language processing techniques to try to elicit and "understand" the "meaning" of some portions of the texts. For that we intend to make heavy use of tools such as the Natural Language Toolkit (NLTK) and full text search engines such as Sphinxsearch, which has a Python API and is integrated into MAriaDb.
On the visualization side, Python shined again. We have barely scratched the possibilities available for constructing visualizations programatically from Python. We've been using Matplotlib for all summary plots of the data, such as line plots, histograms etc. Location specific data was visualized using the xml.dom.minidom module to generate KML files. A large portion of the visualizations also involved building graphs (networks) to connect various informations pieces extracted from the texts. Due to the size and complexity of our dataset, graphs with hundreds of thousand (and sometimes many more) nodes were common and were handled without trouble by the NetworkX package and visualized interactively from Ubigraph, controlled via XMLRPC from Python.
In this talk we intend to give details about the main challenges we've had overcome by the use of Python tools; discuss the most useful tools applied; show some of the resulting visualizations and discuss future directions for the project, which aim to make use of a even larger array of Python tools for text mining, data analysis and visualization.