Biotraceability in Food and Feed Chains--Course Descriptions

Learning from Data

Gary Barker (IFR), Judith Straver (RIVM) and Anders Madsen (HUGIN)
Data has become a dominant feature of modern microbiology: typically single experiments now produce very large and very complex datasets. Data analysis techniques have developed to cope with the data deluge - simple modelling activities have been replaced by distributed searches, pattern discovery, and computer learning.

Outline
The workshop will last 3 hours with a half hour break for refreshments and will be centred on 4 half hour worked examples of ‘learning from data'. Additionally there will be an introduction to terminology and a summary with indications of further approaches. The examples will closely follow the techniques employed in the BIOTRACER project to develop models from data supplied by domain experts. Subjects will include ‘turning numbers into beliefs', ‘learning model parameters' and ‘learning models as structures'. The presenters will include Judith Straver and Gary Barker (the leaders of WPs 1 and 5) and Anders Madsen from HUGIN.

Learning from data - a short introduction to mathematical methods relevant for biotracing
1 - Introduction
Understanding probability, interpreting data as beliefs and mining complex datasets
2 - Worked examples
Uncertainty associated with estimating a proportion, adapting a prior belief based on data, learning about patterns from a dataset and learning structures that represent causal beliefs. Each example will be presented as an actual calculation but, collectively, these will illustrate significant elements in a modelling process including parameter estimation, Bayesian statistics, algorithmics and model choices.
3 - Summary
A context for modelling in BIOTRACER and pointers to developing techniques

Overview
Integrated science projects, like BIOTRACER, lead to the generation of huge amounts of data - in practice too much data for simple quantifications or interpretations. Modern data sets are generally complex, multivariate and uncertain on many levels. When multiple data sets from distinct sources are relevant their comparisons and alignments require considerable effort and expertise. This data deluge has driven changes in mathematical modelling so that now the emphasis is on extracting information and beliefs from large relatively unstructured data sets rather than on describing or summarizing the statistical fluctuations in expected behaviours. In this workshop we will initially introduce concepts, like the interpretation of a probability as a degree of belief and methods of addressing complexity, which contribute to the mathematical modelling that is used for biotracing.
The majority of the workshop will follow several worked examples, which might be relevant to biotracing, that illustrate some specific methods and techniques. We will explore the quantification of beliefs, about a proportion, based on data from a simple sampling process to illustrate Bayesian updating of prior information. We will extend this idea to illustrate the general adaptation of a probabilistic model from the observation of a set of cases. In many cases data sets from high through put experiments contain details that are beyond quantitative description and we will illustrate the discovery of underlying patterns, which do not depend on details, using a simple example of classification. A final example will illustrate how, in complex systems, data alone can be used to establish plausible model structures, and appropriate relationships, without recourse to explicit modelling assumptions.
Mathematical modelling in biology, and particularly in molecular biology, is developing very rapidly. In this short course the examples cover only a small part of this expansion with particular relevance to the BIOTRACER project. As a conclusion the workshop will explore the context of ‘learning from data' in the BIOTRACER project and will list some pointers for further considerations.

Pre-workshop Required Reading*
An elementary introduction to data, information and data analysis - Chapter 1 from "Intelligent Data Analysis" M. Berthold, D.J. Hand (Springer, Berlin, 1999)
An introduction to belief network methods used for learning from data - Chapter 8 from "Bayesian Networks and Influence Diagrams" U.B. Kjaerulff and A.L. Madsen (Springer, New York 2008)
An introduction to biotracing and appropriate quantifications - "An introduction to biotracing in food chain systems" G.C. Barker, N. Gomez, J. Smid Trends in Food Science and Technology http://dx.doi.org/10.1016/j.tifs.2009.03.002
*Required for PhD students.


Post-workshop Review Materials
"Machine learning and its applications to biology "- A.L. Tarca, V.J. Carey, X.W. Chen, R. Romero, S. Draghici PLoS Comput. Biol. 3(6); e116. Doi:10.1371/journal.pcbi.0030116 (2007)
"Probabilistic modelling in Bioinformatics and Medical Informatics" - Eds. D. Husmeier, R. Dybowski, S. Roberts (Springer, London 2005)