Reflections on Big Data, Bioinformatics and the recent UCT/UWC workshop

Monday and Tuesday of this week was largely consumed by a focus on Big Data. First, Ton Engbersen from the IBM/ASTRON Centre for Exascale Technology presented a talk at UCT on microservers, data gravity and microclouds.

The microserver in question is the [DOME]( microserver-freescale/), a system-on-chip based device that crams 128 computers into a 19″ 2U rack drawer. Each computer takes up 13cm x 5 cm (and is 6 mm thick) and provides 12 PowerPC cores with up to 48 GB RAM, resulting in a 2U rack with 1536 cores and over 6 TB of RAM. The whole thing is cooled with warm water, an IBM innovation that is currently in use on the SuperMUC supercomputer in Leipzig, read more about its benefits on their page.

The DOME server is being developed to analyse data from the SKA, an exascale computing problem. The SKA is anticipated to generate between 300 to 1500 petabytes of data per year, putting it on the extreme end of scientific enterprises in terms of data volume. While big data is commonly associated with data volume, researchers at IBM identify four V’s of big data: volume, velocity, variety and veracity. Volume is straightforward. Velocity speaks to the rate at which new data appears. With the amount of sequence data available in GenBank growing at an exponential rate, both volume and velocity of data threaten to outstrip the ability of bioinformatics centres to analyse data. In terms of integration of data, however, my presentation on the big data of tuberculosis focussed more on the variety and veracity of available data. A survey of the data published alongside research articles in the field shows that much of the variety of data gleaned through bioinformatics experiments is lost or only retained in closed institutional databases (and thus effectively lost to the field). An overview of health data collected as part of the NIH-funded Centre for Predictive Computational Phenotyping illustrates the problem of data veracity: electronic health records for patients are often incomplete and lack the vocabulary researchers require to identify disease presence of progression.

Managing the data collections necessary to study e.g. the global state of TB prevalence and treatment will require digital curation of multiple datasets drawn from experiments performed in a range of domains. As Ton Engbersen pointed out that the growing size of data means that “compared to the cost of moving bytes around, everything else is free” (originally a Jim Gray quote). Add to this the (much more tractable) fact that the skills required to build stores and curate these datasets are unevenly distributed, data collections are set to become “the new oil”. Engebersen proposes a solution: micro-clouds that offer the possibility to move code to the data rather than the other way round. Such entities would require a sophisticated cross-institutional authentication framework – almost certainly built on digital certificates – to allow authorised software agents to interface with data. This immediate suggests a set of research priorities to add to SANBI’s existing research projects on data storage and data movement. Luckily this research overlaps with some research interests at UWC Computer Science.

The workshop concluded with some agreements to collaborate between UCT and UWC on big data, but the perspectives delivered show that there is much more at play than the SKA. The fact that both UWC and UCT have established bioinformatics expertise and are located on the established SANReN backbone means that there’s an immediate opportunity to share knowledge and experiments on projects that tackle all four V’s of big data. Lots of ideas… the coming year will see how they can be put into practice.