Biological research is becoming increasingly data intensive. DNA sequencers are now capable of reading thousands of millions of bases in a single run, microarrays can measure the expression of tens of thousands of genes, marker systems can interrogate tens of thousands of polymorphisms and the latest automated phenotyping systems can screen thousands of plants with little human intervention.
Combined with the internet, which enables the open sharing of this data between international groups, data overload can generate a significant headache for researchers. While bioinformatics cannot provide an easy answer to this problem, it can promote quality research by making data accessible for integration and interrogation.
The University of Melbourne Node
The Melbourne node develops software and techniques aimed at support of the Metabolomics and Proteomics research programmes and their integration with Genomics and Transcriptomics data-sets within the ACPFG. This is enhanced by a strong collaboration with Metabolomics Australia.
KNIME extensions for bioinformatics: we have an array of more than ten software extensions for the KNIME data analysis platform with support for analysis of protein and metabolite samples and to streamline processing and scalability of data analysis. This includes integration of European Bioinformatics Institute (EBI) web services, Smith-Waterman alignment, BioJava and other integrated extensions into the platform. Development is ongoing.
Drought Comparator websites: Built using industry-standard R and Apache website technology, these sites provide visualisation of the Metabolomics and Proteomics drought results via the ACPFG intranet.
Virtualized Appliances: to support specialised tasks required, a separate computer is often required to avoid conflict with existing facilities. In the modern era, this can be done without purchasing an additional computer. Numerous virtual appliances for specific research projects have been created, among them:
1. Metabolomics Australia data analysis using multi-variate statistical analysis via Unscrambler, AnalyzerPro and Agilent GeneSpring
2. Molecule modelling eg. 3D visualisation
3. Conversion of Analyst® data files to open-format for use with spectral identification
4. Fully configured and operational Trans-Proteomic Pipeline (TPP) appliance
5. Cost-reduction using central network-based desktops for common software packages
IT Infrastructure: Proteomics has received a significant boost through four new instruments. For the first time, the data analysis can now support targeted studies of proteins/peptides within a mixture. To maximise available quantitation data, we perform spectral identification using the TPP in additional to industry-standard sequence database searching using MatrixScience Mascot® and X!Tandem
The University of Adelaide node
Bioinformatics software development
The various -omics platforms in common use in present-day biotechnology generate vast amounts of data. The analysis of this data is expedited by the construction of specialized software tools including:
As part of the GrassKin project, we have been developing algorithms for clustering genes based on the similarity of their sequences. These algorithms are specifically designed for situations, commonly encountered in crop science, where only incomplete knowledge of gene sequences is available.
This year, these algorithms have proved useful outside their intended application: they allowed to help with the identification of individual homoeologs on Chromosome 7 of wheat and, when combined with simultaneous clustering of gene expression data, have helped with the identification of genes potentially involved in the biosynthesis of the (1,3;1,4)-b-D-glucan enzyme found in plant cell walls.
The University of Brisbane node
Genome sequencing technology
Through activities at UQ, the ACPFG has access to all three next generation sequencing technologies, the Illumina Solexa, Roche 454 and Applied Biosystems SOLiD. Under ARC linkage funding, we are developing sequence assembly methods which will provide inexpensive sequencing of plant genomes.
We are applying this sequencing technology for gene discovery, expression analysis and physical genome mapping supporting several ACPFG projects.
A database hosting a large quantity of plant short paired read sequence data produced by the Illumina GAII system is available at http://flora.acpfg.com.au/tagdb. This data can be searched over the web to identify novel genes and gene promoters in crop plants including barley and hexaploid bread wheat.
Molecular marker discovery
The discovery of molecular genetic markers from available sequence data is the most cost effective approach for high resolution genetic studies. We have pioneered methods for SSR and SNP discovery from sequence data and continue with development of pipelines, databases and web interfaces for the discovery and annotation of markers from the latest high throughput sequence data.
An initial database of annotated barley SNPs (autoSNPdb) has been produced and a manuscript describing this database has recently been published in Nucleic Acids Research. Additional databases have been produced for Brassica and rice, while a wheat database is in preparation.
Software has also been developed for the identification polymorphic SSR markers and this has been assessed and validated in barley. An SSR discovery tool (SSRPrimerII) has been developed and is available at: http://flora.acpfg.com.au/ssrprimer2.
BAC and Gene Annotator
To complement activities in gene and genome sequencing and towards the development of an integrated sequence management system for the ACPFG, a web based BAC and Genome Annotator tool has been developed. The tool is available at http://flora.acpfg.com.au/bga.
Australian Wheat Pedigrees Database
The database system (http://gwis.lafs.uq.edu.au) delivers comprehensive datasets with Australian wheat pedigrees and ancestry information. It has maintained and updated with historical information on Australian Triticale and durum wheat pedigrees obtained from the CIMMYT (International Centre for the Improvement of Wheat and Maize) library in Mexico.
A wheat phenome atlas
Analysis techniques are being applied to phenotypic, marker and pedigree data, generated for part of the CIMMYT spring wheat breeding program for the detection of marker-trait associations. The phenotypic data consisted of over 420,000 observations on 1445 trials in 400 locations on 599 lines for seven traits collected over 25 years. A total of 1447 polymorphic DArT markers have been applied in this study, and marker trait associations have been detected.
Establishing bioinformatics hardware for the ACPFG
Five high performance computers were recently purchased to support bioinformatics activities at Adelaide and UQ. Two machines have been installed in Adelaide and web services have been implemented including SSRPrimerII, BAC and Gene Annotator and TAGdb. Further tools and applications are under development and will be mirrored at UQ.
The University of South Australia node
The University of South Australia node is focused on developing capabilities in two areas of importance to the ACPFG: plant phenomics and plant computational modelling.
Our work in plant phenomics supports the Plant Accelerator facility and will ensure that the ACPFG maximizes its utility from this facility.
Our work is focused on developing computational methods and tools to allow the reliable extraction of biologically-relevant measurements from the high-throughput image data generated by the Plant Accelerator and to allow these measurements to be used to identify genotypes of interest.
In plant computational modelling, we seek to integrate data from the multiple -omic platforms of the ACPFG to generate novel scientific results and to support and advance the various scientific programs of the ACPFG.