The integration of genetics information with neuroimaging data promises to significantly improve our understanding of both normal and pathological variability of brain organization. It should lead to the development of biomarkers and in the future personalized medicine. Among other important steps, this endeavor requires the development of adapted statistical methods to extract significant correlations between variables provided by genotyping and brain imaging, and the development of the software components that will permit large computation to be done. In this work, we focus on this latest point, using scientific Python and MapReduce [dea08].
In current settings, neuroimaging-genetic datasets consist of a set of :
- genotyping measurements on genetic variables, such as Single Nucleotide Polymorphisms (SNPs), on the one hand, and
- quantitative level at given locations (voxels) in 3D images, that represent either the amount of functional activation in response to a certain task or an anatomical feature.
The standard approach for such studies is the Mass Univariate Linear Model, that considers each (SNP, voxel) pair independently and tests the significance of the correlation between these traits. With 50,000 voxels and 500,000 SNP, the number of comparisons raises to 25 billions. In the brain imaging domain, cluster-level analysis techniques have been proposed [hay03], where the statistical test deals with the size of supra-threshold regions. In the absence of accurate statistical model of the largest cluster size under the null hypothesis, this approach requires permutations (up to 104 ) to control the significance of the decision statistic. Traditional computational architectures that rely on popular analysis softwares (Plink or SPM) cannot manage this load in a reasonable amount of time. Working in a distributed context is necessary to deal with the memory and computational loads, and yields specific optimization strategies and choices :
- With permutation tests [and01] the most natural way of spliting the problem consists in distributing the computations according to permutations, but this causes inefficient access to data, because each sub-problem needs all the data. Furthermore, if we profile the execution of the Ordinary Least Square (OLS) regression, we show that taking voxels or SNPs one by one is inefficient (see Fig. 1).
- There is early sources of optimization with the statistical models, at linear algebra level for instance, like normalization and orthogonalization of the data matrices, that can simplify computation to matrix additions and products. And sometimes it is worth the cost to compute a time consuming intermediary result and to cache it, rather than doing many small computations.
- The same level of detail cannot be handled depending on whether few genes or the whole genome is considered. In our setting, keeping all the comparisons represents 2 PB in double precision. Correlations below a threshold can be discarded to report only the best in single precision.
- For our application, the MapReduce approach (see Fig. 2) provide a good solution in terms of scalability both for the calculation and for data management and can add some convenient features like resilience to failure and redundancy. And the most important point, is that the parallel aspects is hidden to lets us focus on performance of the sequential algorithms.
- Our Python code relies on proven softwares that deliver good performance, i.e. Numpy & Scipy, which use standard and optimized linear algebra libraries (Atlas or MKL) and are several order of magnitude faster than naive code and Joblib for efficient short-term persistence.
- We want to run computations on multiple infrastructures: clusters, grids and clouds. We choose Hadoop and batch on clusters and grids, and another team provides us TomusMapReduce + TomusBlobs [tud12] and expertise to run our MapReduce code in Python on Azure Cloud.
For the first time, we propose an efficient framework that can manage cluster-based inference in a brain-wide genome-wide association study. Compared to the only previous study [ste10], the map task achieve the computations thousands time faster. This framework already ran on clusters, grid'5000 and the Microsoft Azure cloud, therefore on Linux and Windows systems. With hindsight, trendy technologies are not always best suited. For instance, the difficulty to deploy and tune Hadoop is not justified by the gain in performance compared to batch jobs. And, data transfert and virtual machines deployment make usage of the cloud difficult and require additional knowledge.
Importantly, with a limited time budget to understand the problem, the data, the mathematical tools and to produce a efficient implementation, Python with NumPy provide a natural choice to achieve a good tradeoff between development time and computation time. Probably, with more time we could explore other optimization alleys like pyTables to only name one. But, here we have reach a satisfactory solution to the computational problem and can move on to the neuroimaging-genetic one. In particular, now that the problem is computationally tractable, we can explore various statistical methods.
|[and01]||M. J. Anderson and J. Robinson. Permutation tests for linear models, Australian and New Zealand Journal of Statistics, (43):75--88, 2001.|
|[ste10]||J. L. Stein et al. Voxelwise genome-wide association study (vgwas), Neuroimage, 53(3):1160--1174, Nov 2010.|
|[hay03]||S. Hayasaka and T. E. Nichols. Validating cluster size inference: random field and permutation methods, Neuroimage, 20(4):2343--2356, Dec 2003.|
|[dea08]||J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters, Commun. ACM, 51(1):107--113, Jan 2008.|
|[tud12]||R. Tudoran, A. Costan, G. Antoniu, H. Soncu TomusBlobs: Towards Communication-Efficient Storage for MapReduce Applications in Azure, 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'2012), 2012.|