ENBIS-11 in Coimbra

4 – 8 September 2011 Abstract submission: 1 January – 25 June 2011

Kernel Machines for Proteomics Data Analysis: Algorithms and Tools

6 September 2011, 15:25 – 15:45

Abstract

Submitted by
António Dourado
Authors
C. Pereira1,2, L. Morgado1, D. Correia2, P. Verissimo3, A. Dourado1
Affiliation
CISUC- Center for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering Polo II University 3030-290 Coimbra
Abstract
It is recognized that the field of Computational Biology still needs a better research framework based on data mining and knowledge management to guide their development, leading currently to an active search both for new approaches or innovative uses of the existing techniques. Discriminative models like Kernel machines (KM), generative models like Hidden Markov Models, traditional sequence similarity algorithms like PSI-BLAST and combination of these methods have been applied. The application of support vector machines (SVMs) in particular, usually gets good performance but require significant training time.

This paper gives a short review of kernel machines applied for proteomics, discussing relevant issues like information and structure complexity, incremental learning and feature selection and presents the developed tools including the most suitable machines and algorithms for feature extraction and selection applied for enzyme detection.
This work is included in a larger project aiming the development of efficient incremental kernel machines for biological data analysis, the BIOINK Project - Incremental Kernel Machines for Biological Data Analysis. This project has the following generic goals:
• Development of kernel algorithms specifically designed for proteomics, and comparing the relative strengths of kernel and non kernel methods on large scale datasets;
• Development of efficient and adaptive feature selection methods;
• Fast implementation of kernel machines - Improve the optimization algorithms and distribute them on a GRID or multi-core environment;
• Incorporations of text mining tools in order to be able to retrieve clusters of biological relevance in result to a query from a user, and present them in hierarchical logical manner;
• Share the developed techniques in software applications;
One of the main goals of this Project was to create a platform dedicated to enzyme analysis starting from raw genome data analysis to the enzyme classification. In that framework, a software application called PEPTILAB has been developed, including feature selection and SVM models specifically trained for peptidase detection and classification. A recursive feature elimination algorithm was applied, proving to be an useful technique to decrease significantly the number of necessary features for an accurate result, thus allowing an efficient and faster classifier. The PEPTILAB tool includes the following main functionalities: simple statistics (e.g. amino acid, nucleotide and codon frequency, etc), Open Reading Frames (ORF) search, nucleotide to protein representation change and vice-versa, multiple alignment analysis, phylogenetic trees, protein physicochemical properties estimation, peptidase detection and classification, protein 3D structure viewers, and access to online resources. A statistical validation facility was also integrated, enabling to measure the performance of the prediction algorithms.
Also within the framework of this project, the algorithms are accessible to the scientific community through a web platform. Registered users may submit multiple sequences in Fasta format for Peptidase detection. Currently the web site is being updated with new algorithms and functionalities for registered users, in particular the creation of queries for literature search based in the sequence classification results and text-mining based algorithms for feature extraction.
View paper

Return to programme