User Constrained Multiple Subsequence Search For Proteins
Research Mentor(s)
Jagodzinski, Filip
Description
Proteins are long sequences of amino acids which can be represented abstractly as words of single letter amino acid abbreviations such as GAPPM. There are twenty naturally occurring amino acids. Certain regions of proteins are more functionally, structurally or evolutionarily significant than other regions. Knowing how a protein sequence differs from other proteins provides information about the prevalence of often-occurring amino acid sub-sequences among large sets of proteins. A classical problem in bioinformatics is the following: Given a query sequence CGMMY, output a list of known proteins, ranked by highest match, that have the query subsequence among their amino acids. Although there exist algorithms for identifying proteins most similar to a single query amino acid, there is no algorithm for multiple query sub-sequences of different lengths with user provided priority weights. We are developing a protein search algorithm that scores similarity results based on weights associated to each subsequence provided by the user. The algorithm finds and ranks sequences that best satisfy the users multiple queries taking into account a variety of factors. The algorithm uses a combination of existing alignment tools, as well as our edit distance algorithm based on the Levenshtein Distance.
Document Type
Event
Start Date
17-5-2018 12:00 AM
End Date
17-5-2018 12:00 AM
Department
Computer Science
Genre/Form
student projects, posters
Subjects – Topical (LCSH)
Bioinformatics; Systems biology; Algorithms; Proteins
Type
Image
Rights
Copying of this document in whole or in part is allowable only for scholarly purposes. It is understood, however, that any copying or publication of this document for commercial purposes, or for financial gain, shall not be allowed without the author’s written permission.
Language
English
Format
application/pdf
User Constrained Multiple Subsequence Search For Proteins
Proteins are long sequences of amino acids which can be represented abstractly as words of single letter amino acid abbreviations such as GAPPM. There are twenty naturally occurring amino acids. Certain regions of proteins are more functionally, structurally or evolutionarily significant than other regions. Knowing how a protein sequence differs from other proteins provides information about the prevalence of often-occurring amino acid sub-sequences among large sets of proteins. A classical problem in bioinformatics is the following: Given a query sequence CGMMY, output a list of known proteins, ranked by highest match, that have the query subsequence among their amino acids. Although there exist algorithms for identifying proteins most similar to a single query amino acid, there is no algorithm for multiple query sub-sequences of different lengths with user provided priority weights. We are developing a protein search algorithm that scores similarity results based on weights associated to each subsequence provided by the user. The algorithm finds and ranks sequences that best satisfy the users multiple queries taking into account a variety of factors. The algorithm uses a combination of existing alignment tools, as well as our edit distance algorithm based on the Levenshtein Distance.