Evaluation of Techniques for Classifying Biological Sequences.ppt
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and e Karypis Speaker: Sarah Chan CSIS DB Seminar May 31, 2002 Presentation Outline Introduction Traditional Approaches (kNN, Markov Models) to Sequence Classification Feature Based Sequence Classification Experimental Evaluation Conclusions Introduction The amount of biological sequences available in public databases is increasing exponentially GenBank: 16 billion DNA base-pairs PIR: over 230,000 protein sequences Strong sequence similarity often translates to functional and structural relations Classification algorithms applied on sequence data can be used to gain valuable insights on functions and relations of sequences . to assign a protein sequence to a protein family Introduction K-nearest neighbor, Markov models and Hidden Markov models have been extensively used They have considered the sequential constraints present in datasets Motivation: Few attempts to use traditional machine learning classification algorithms such as decision trees and support vector machines They were thought of not being able to model sequential nature of datasets Focus of This Paper To evaluate some widely used sequence classification algorithms K-nearest neighbor Markov models To develop a framework to model sequences such that traditional machine learning algorithms can be easily applied Represent each sequence as a vector in a derived feature space, and then use SVMs to build a sequence classifier Problem Definition- Sequence Classification A sequence Sr = {x1, x2, x3, .. xl} is an ordered list of symbols The alphabet for symbols: known in advance and of fixed size N Each sequence Sr has a class label Cr Assumption: Two class labels only (C+, C-) Goal: To correctly assign a class label to a test sequence Approach 1:K Nearest Neighbor (KNN) Classifiers To classify a test sequence Sr Locate K training sequences being most similar to Sr Assign to Sr the class label which oc
Evaluation of Techniques for Classifying Biological Sequences 来自淘豆网www.taodocs.com转载请标明出处.