Event Title

User Constrained Multiple Subsequence Search For Proteins

Research Mentor(s)

Filip Jagodzinski

Description

Proteins are long sequences of amino acids which can be represented abstractly as words of single letter amino acid abbreviations such as GAPPM. There are twenty naturally occurring amino acids. Certain regions of proteins are more functionally, structurally or evolutionarily significant than other regions. Knowing how a protein sequence differs from other proteins provides information about the prevalence of often-occurring amino acid sub-sequences among large sets of proteins. A classical problem in bioinformatics is the following: Given a query sequence CGMMY, output a list of known proteins, ranked by highest match, that have the query subsequence among their amino acids. Although there exist algorithms for identifying proteins most similar to a single query amino acid, there is no algorithm for multiple query sub-sequences of different lengths with user provided priority weights. We are developing a protein search algorithm that scores similarity results based on weights associated to each subsequence provided by the user. The algorithm finds and ranks sequences that best satisfy the users multiple queries taking into account a variety of factors. The algorithm uses a combination of existing alignment tools, as well as our edit distance algorithm based on the Levenshtein Distance.

Document Type

Event

Start Date

May 2018

End Date

May 2018

Location

Computer Sciences

Rights

Copying of this document in whole or in part is allowable only for scholarly purposes. It is understood, however, that any copying or publication of this document for commercial purposes, or for financial gain, shall not be allowed without the author’s written permission.

Language

English

Format

application/pdf

This document is currently not available here.

Share

COinS
 
May 17th, 9:00 AM May 17th, 12:00 PM

User Constrained Multiple Subsequence Search For Proteins

Computer Sciences

Proteins are long sequences of amino acids which can be represented abstractly as words of single letter amino acid abbreviations such as GAPPM. There are twenty naturally occurring amino acids. Certain regions of proteins are more functionally, structurally or evolutionarily significant than other regions. Knowing how a protein sequence differs from other proteins provides information about the prevalence of often-occurring amino acid sub-sequences among large sets of proteins. A classical problem in bioinformatics is the following: Given a query sequence CGMMY, output a list of known proteins, ranked by highest match, that have the query subsequence among their amino acids. Although there exist algorithms for identifying proteins most similar to a single query amino acid, there is no algorithm for multiple query sub-sequences of different lengths with user provided priority weights. We are developing a protein search algorithm that scores similarity results based on weights associated to each subsequence provided by the user. The algorithm finds and ranks sequences that best satisfy the users multiple queries taking into account a variety of factors. The algorithm uses a combination of existing alignment tools, as well as our edit distance algorithm based on the Levenshtein Distance.