Bioinformatics
Algorithms
Needleman–Wunsch
Needleman–Wunsch algorithm from the 70’s was used to align two protein or nucleotide sequences. Here is an iterative version of the algorithm I implemented as a homework exercise last year.
The algorithm uses a scoring system to find the optimal alignment of the two sequences. The most basic version of a scoring system is:
- +1 for a match
- -1 for a mismatch
- -2 for a gap ( indel )
Implementation
The algorithm works as follows:
First two empty matrices are created with the dimensions of sequence 1 and 2 (rows/columns). The first matrix mat
will hold the highest possible score, the other matrix pathmat
will hold the position from which field the highest score came from. We then iterate forward through both of the sequences, calculate the scores at each position, and save the highest score at each position to mat
, and where it came from to pathmat
. Last we traceback (i.e. iterate backwards from bottom right to top left) through the pathmat
according to where we came from (i.e. 0: match/mismatch -> go up&left; 1: insertion -> go left; 2: deletion -> go up.) and build our sequence alignment accordingly.
Scoring function
import numpy as np
def scoring(s1,s2,match,gap,mismatch):
= len(s1)+1
l1 = len(s2)+1
l2 = np.zeros((l1,l2),dtype= int) # for score
mat = np.zeros((l1,l2),dtype= int) # for traceback
pathmat 0] = [ i *gap for i in range(l1)]# fill first col
mat[:,0,:] = [ j *gap for j in range(l2)]# fill first row
mat[for i in range(1,l1):
for j in range(1,l2):
=[] # save scores for position
n-1,j-1] + (s1[i-1] == s2[j-1])*match +(s1[i-1] != s2[j-1])*mismatch )# match
n.append(mat[i-1] + gap) # insertion
n.append(mat[i,j-1,j] + gap) # deletion
n.append(mat[i= max(n) # keep maximal score
mat[i,j] = np.argmax(n) # keep position
pathmat[i,j] return(pathmat)
Traceback Function
In the next step we define a function that traces back through the pathmat
matrix and check the value in the field, checking from which field the highest score stems from.
def traceback(s1,s2,pathmat):
= len(s1)
i = len(s2)
j = [] # alignment for seq 1
ref = [] # alignment for seq 2
read while i and j != 0:
if pathmat[i, j] == 0:
-1])
ref.append(s1[i-1])
read.append(s2[j-= 1 #
i -= 1 # go to upper left
j elif pathmat[i, j] == 1:
'-')
ref.append(-1])
read.append(s2[j-= 1 # go left
j elif pathmat[i, j] == 2:
-1])
ref.append(s1[i'-')
read.append(-= 1 # go up
i return([ref,read])
This returns the objects ref
the reference sequence, aligned to the read
the read.
Print Alignment
So lets define a simple text output function to print the alignments
def PrintAlignment(ref,read):
= ["|" if ref[i] == read[i] else " " for i in range(len(ref))]
rmatch = "".join(ref)[::-1] # reference seq
ref = "".join(rmatch)[::-1]
mid = "".join(read)[::-1] # read seq
read for i in range(80,len(mid)+80,80):
= i-80
last = " "*(80-(len(str(i))+len(str(last))))
sp print(last+1,sp,i)
print(ref[last:i])
print(mid[last:i])
print(read[last:i])
Main function
def Needleman_Wunsch(s1,s2,match,gap,mismatch):
### Scoring:
= scoring(s1,s2,match,gap,mismatch)
pathmat ### Traceback :
= traceback(s1,s2,pathmat)
ref, read ### Print:
PrintAlignment(ref, read)
We can now test the function using two slightly different DNA sequences:
= 'atcgccggcaacactcaaaaaaatttagatcgatgggagagagttcgatacaggggggatcgcccatcgag'
seq1 = 'atcgctttcggcaacagtcaaaaaaatagatcgatgggggacccgagttcgatacagatcgcccatgcgt' seq2
Before the program runs we specify scores as follows:
= 2 # +1 for matching
match = -1 # -1 for mismatch
mismatch = -2 # -2 for gap
gap
Needleman_Wunsch(seq1,seq2,match,gap,mismatch)
1 80
atcgc---cggcaacactcaaaaaaatttagatcgatgggaga---gagttcgatacaggggggatcgcccat-cgag
||||| |||||||| ||||||||| |||||||||||| || |||||||||||| |||||||||| ||
atcgctttcggcaacagtcaaaaaaa--tagatcgatgggggacccgagttcgataca-----gatcgcccatgcg-t
Lets see what happens if we penalize the gaps harder:
- +1 for matching
- -1 for mismatch
- -2 for gap
= 2 # +1 for matching
match = -1 # -1 for mismatch
mismatch = -5 # -2 for gap
gap
Needleman_Wunsch(seq1,seq2,match,gap,mismatch)
1 80
atcgc---cggcaacactcaaaaaaatttagatcgatgggagagagttcgatacaggggggatcgcccatcgag
||||| |||||||| ||||||||| |||||||||||| | || | | ||||||||||
atcgctttcggcaacagtcaaaaaaa--tagatcgatggg-g-gacccgagttcgatacagatcgcccatgcgt
We get a new alignment, as one would expect and the second gap got removed istead we have a larger space with mismatches.
- +1 for matching
- -1 for mismatch
- -5 for gap
These sequences were very short, this would probably even work using an recursive version of NW.
Intron Exon
So to test this using a real world example we can use a DNA and the corresponding RNA and align those together.
The DNA sequence encodes the Lotus japonicus nin gene for nodule inception protein. It has the gene bank ID GenBank: AJ238956.1
and is about 3,4 kbp long.
The RNA sequence comes from the corresponding Lotus corniculatus var. japonicus mRNA for nodule inception protein (nin) GenBank: AJ239041.1
and is abput 2,9 kbp long.
I have both sequence downloaded to my local machine and read them using the read_text
function.
from pathlib import Path
= Path('AJ238956_1.fasta').read_text().splitlines()
f = ''.join(f[1:])
DNA = Path('AJ239041_1.fasta').read_text().splitlines()
f = ''.join(f[1:]) mRNA
= 2 # +1 for matching
match = -1 # -1 for mismatch
mismatch = -2 # -2 for gap
gap
Needleman_Wunsch(DNA,mRNA,match,gap,mismatch)
1 80
CCCAAGCACTGCTTATTAATTACAAACCACCTTCTCATCTCTTGCTCTTCAACTTAGCACTAGACAGAATCTAGCTTCAA
||||||||||||||||||||||||||||||||||||| || | | | | | || |||| |
CCCAAGCACTGCTTATTAATTACAAACCACCTTCTCA--------TC-T---C-T-----T-----G---CT--CTTC-A
81 160
ACCTAATTAAGGTGAGTGAGTCTTGTACTTGATCCACTCTTGATATTCATGCCCTCACTTTCTTTCTCACTTACTATGTG
| | | | || | || | | | | || | | | | |
A-C---------T---T-AG-C----AC----T--A------A-A--CA-G-----A-----------A--T-C------
161 240
TTTGTGGATATGTCTGAATTAATGATGATTCAACTCACTTAACTGATCATGTCAAATTACAGGTACTTAATTGGATCAGC
|| | | | ||| ||| | | ||||| ||||||||||||||||||||
--------TA-G-C---------------T----TCA---AAC----C---T--AATTA-AGGTACTTAATTGGATCAGC
241 320
TAGCATGGAATATGGTTCATTACTAGTGCAGCAGCAGCAGCAGCAGGATTGTAATTCTGCCTATGGGAGCTTGTCTAATT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TAGCATGGAATATGGTTCATTACTAGTGCAGCAGCAGCAGCAGCAGGATTGTAATTCTGCCTATGGGAGCTTGTCTAATT
321 400
TATCATCAGACTGTGGATCAGTGACAACAGCAGAAGCAGATCATCACATCATTGAGGAGCTGTTAGTACAAGGTTGCTGG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TATCATCAGACTGTGGATCAGTGACAACAGCAGAAGCAGATCATCACATCATTGAGGAGCTGTTAGTACAAGGTTGCTGG
401 480
GTGGAAGTTAGTGGTGTTGGTGTTAGGGAGGGTGAGCTCCAGCTCCAACAAGACGAATCAAGTTTTGTGGTGGGGAAGAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GTGGAAGTTAGTGGTGTTGGTGTTAGGGAGGGTGAGCTCCAGCTCCAACAAGACGAATCAAGTTTTGTGGTGGGGAAGAG
481 560
ATGGTGGATTGGACCAGCTGCAGCAGTAGCAGGATCATGTAATTCCTCTGTCAAAGAGAGACTAGTGATTGCTGTTGGGT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ATGGTGGATTGGACCAGCTGCAGCAGTAGCAGGATCATGTAATTCCTCTGTCAAAGAGAGACTAGTGATTGCTGTTGGGT
561 640
ACTTGAAAGACTACACAAGAAACTCCAACGTGCTCATTCAGATATGGGTGCCCTTACGAAGAGGCATCCTTCATGATCAT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ACTTGAAAGACTACACAAGAAACTCCAACGTGCTCATTCAGATATGGGTGCCCTTACGAAGAGGCATCCTTCATGATCAT
641 720
GATTACCACACAAATTATTTACTTTCCAATAATCCTCCTCCTCAACCAGAAGCAGCAGCAGATCATGAATCTGTTTCACT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GATTACCACACAAATTATTTACTTTCCAATAATCCTCCTCCTCAACCAGAAGCAGCAGCAGATCATGAATCTGTTTCACT
721 800
TGGTTTTCCCATGCCAGCAGCTCCCAACTCAAACTTATACTCAAACGTGCACGTCCGTTTTTTTAGAAGCCACGAGTACC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TGGTTTTCCCATGCCAGCAGCTCCCAACTCAAACTTATACTCAAACGTGCACGTCCGTTTTTTTAGAAGCCACGAGTACC
801 880
CGCGCGTTCAGGCTCAGCAGTATGGATCTTTAGCCCTTCCTGTCTTCGAAAGAGGCACCGGGACATGCCTTGGTGTCCTT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CGCGCGTTCAGGCTCAGCAGTATGGATCTTTAGCCCTTCCTGTCTTCGAAAGAGGCACCGGGACATGCCTTGGTGTCCTT
881 960
GAGATTGTAATCACCAACCAAACCACCATCAACTACAATGTCTCCAATGCTCTTGATCAGGTACCTAGCTACTTTACTTG
||||||| | || | | | |||| | | | | | | | || |
GAGATTG---------------------T-AA-T-C-A-----CCAA--C-C---A--A---A-C---C-AC----C---
961 1040
ATTACTTTTCTTCTATCTACTCCCAATATTTCAGTAAAATGTGGTACTGATTATCACTTCAATAACCGCTCACACTTACA
| || | | | || | || | || | | | |||| | | |
---A-------TC-A---A----C--TA---C-----AA--T-GT-C------T--C--CAAT----G------C-T---
1041 1120
AGCCTTTTGTATCCAATTAATGTTTTGCAGGCTGTTGATTTTAGAAGCAGCCAGAGCTTCATTCCACCTGCCATCAAGGT
| ||| | | |||||||||||||||||||||||||||||||||| | | | |
---C--TTG----------A-----T-CAGGCTGTTGATTTTAGAAGCAGCCAGAGCTTCA-T-----T--C--C-----
1121 1200
AGTGTGCCCTAATTATTTGTAAGGATGGATGTTCTTCATTATTTAAAGTTTAAGAAACAATAATTAATTAATAAATATTT
| || || | | | | |
A------CC--------------------TG--C--C---------A---T------C----------------------
1201 1280
ATGATTAATGATGATTAGGTATATGACGAGTTGTACCAAGCAGCAGTGAATGAGATCATAGAGGTGATGACTTCTGTTTG
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-------------A--AGGTATATGACGAGTTGTACCAAGCAGCAGTGAATGAGATCATAGAGGTGATGACTTCTGTTTG
1281 1360
CAAGACACACAATTTGCCATTAGCACTCACTTGGGCTCCTTGCATCCAACAAGGCAAGTGTGGTTGTGGGGTTTCATCTG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CAAGACACACAATTTGCCATTAGCACTCACTTGGGCTCCTTGCATCCAACAAGGCAAGTGTGGTTGTGGGGTTTCATCTG
1361 1440
AGAATTACATGTGGTGCGTCTCCACCGTGGATTCTGCGTGCTTTGTTGGGGATCTAGACATATTGGGATTCCAAGAAGCG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AGAATTACATGTGGTGCGTCTCCACCGTGGATTCTGCGTGCTTTGTTGGGGATCTAGACATATTGGGATTCCAAGAAGCG
1441 1520
TGCTCTGAGTATCACCTTTTCCGGGGACAAGGAATTGTTGGTACAGCCTTCACAACATCCAAACCGTGTTTCGCCATTGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TGCTCTGAGTATCACCTTTTCCGGGGACAAGGAATTGTTGGTACAGCCTTCACAACATCCAAACCGTGTTTCGCCATTGA
1521 1600
CATCACAGCCTTCAGCAAAGCTGAATACCCTCTTGCTCACCATGCAAACATGTTCGGATTGCATGCAGCTGTGGCCATTC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CATCACAGCCTTCAGCAAAGCTGAATACCCTCTTGCTCACCATGCAAACATGTTCGGATTGCATGCAGCTGTGGCCATTC
1601 1680
CACTCAGGAGTGTCTACACCGGTTCAGCTGCTGATTTTGTTTTGGAGTTTTTCCTTCCAAAAGATTGCCATGACAGTGAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CACTCAGGAGTGTCTACACCGGTTCAGCTGCTGATTTTGTTTTGGAGTTTTTCCTTCCAAAAGATTGCCATGACAGTGAA
1681 1760
GAGCAGAAGCAGTTGCTCAACTCATTGTCCATGGTGGTACAGCAAGCTTGCCGCAGCTTGCATGTGGTACTGGTGGAGGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAGCAGAAGCAGTTGCTCAACTCATTGTCCATGGTGGTACAGCAAGCTTGCCGCAGCTTGCATGTGGTACTGGTGGAGGA
1761 1840
TGAATACACCTTGCCGATGCCATCACACACTAGTAAAGAGGAATTAGAAGAAGAAGAGATTACCATTACCAATAATCATG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TGAATACACCTTGCCGATGCCATCACACACTAGTAAAGAGGAATTAGAAGAAGAAGAGATTACCATTACCAATAATCATG
1841 1920
AGCAGAAATTATTTGTGTCTCCCTCCTCTCATGAATCAGAATGTTCAAAAGAATCATCATGGATTGCACACATGATGGAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AGCAGAAATTATTTGTGTCTCCCTCCTCTCATGAATCAGAATGTTCAAAAGAATCATCATGGATTGCACACATGATGGAG
1921 2000
GCTCAGCAGAAGGGGAAAGGTGTGTCTGTGTCTTTGGAGTACTTGGAAGAACCAAAGGAGGAGTTCAAAGTAACAACAAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GCTCAGCAGAAGGGGAAAGGTGTGTCTGTGTCTTTGGAGTACTTGGAAGAACCAAAGGAGGAGTTCAAAGTAACAACAAA
2001 2080
CTGGGACAGCAGCACTGACCATGATCAGCAGGCGCAGGTGTTTTCATCAGATTTTGGGCAAATGAGTTCAGGATTCAAAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CTGGGACAGCAGCACTGACCATGATCAGCAGGCGCAGGTGTTTTCATCAGATTTTGGGCAAATGAGTTCAGGATTCAAAG
2081 2160
CAAGTACTGTGGAGGGTGGGGATCAAGAGTCTTCTTATACCTTTGGAAGCCGCCGCTCGTCTTCTGGTGGAAGAAAGTCA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CAAGTACTGTGGAGGGTGGGGATCAAGAGTCTTCTTATACCTTTGGAAGCCGCCGCTCGTCTTCTGGTGGAAGAAAGTCA
2161 2240
GGCGAGAAGAGACGAACCAAGGCAGAGAAGACTATCAGCTTGCAAGTTCTAAGACAGTACTTTGCAGGAAGCCTAAAAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| || ||
GGCGAGAAGAGACGAACCAAGGCAGAGAAGACTATCAGCTTGCAAGTTCTAAGACAGTACTTTGC----AG-------GA
2241 2320
TGCAGCAAAGAGCATTGGTGGTGAGTGACATATCTTCTTGAATTCAACTCAACCATTCAATTAATTATTAAATGATAACA
||| | | | | | | | ||| |
---AGC------C--------T-A---A-A-A-------G-------------------------------ATG----C-
2321 2400
AAGTTAATTTAGTGGGTGACTTAATTAAGAAAGTGATCCTTGTATGGTTGTACAGTTAAATGATGGTGTTTCCTTTTGAT
| | | || | | | | | || || |
----------A------G-C----------AA---A-----G-A-----G--C---------AT--TG---------G--
2401 2480
TTCTACAGTATGTCCCACAACTCTGAAAAGGATATGCAGACAGCATGGCATCACCAGGTGGCCTTCAAGGAAAATCAAGA
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---T---GTATGTCCCACAACTCTGAAAAGGATATGCAGACAGCATGGCATCACCAGGTGGCCTTCAAGGAAAATCAAGA
2481 2560
AGGTTGGCCATTCTTTAAAAAAGCTTCAGCTGGTGATTGATTCAGTGCAGGGTGCTGAGGGTGCCATACAGATTGGCTCA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AGGTTGGCCATTCTTTAAAAAAGCTTCAGCTGGTGATTGATTCAGTGCAGGGTGCTGAGGGTGCCATACAGATTGGCTCA
2561 2640
TTCTATGCCAGTTTTCCAGAGTTGAGCTCTTCAGATTTCTCTGCCAGTTGCCGTTCTGATTCATCCAAGAAAATGCATAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTCTATGCCAGTTTTCCAGAGTTGAGCTCTTCAGATTTCTCTGCCAGTTGCCGTTCTGATTCATCCAAGAAAATGCATAA
2641 2720
TTATCCTGATCAGAACAACACCTTGTATGGCCATGGAGATCATGGAGGAGTTGTTACTAGTTTAAAATCTCCACCCTCTG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTATCCTGATCAGAACAACACCTTGTATGGCCATGGAGATCATGGAGGAGTTGTTACTAGTTTAAAATCTCCACCCTCTG
2721 2800
CTTGCAGCCAAACTTTTGCAGGAAACCAACCATGCACCATCATCAACAATGGAGATGTTCTTATGACAGAAAGTCCTCCA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CTTGCAGCCAAACTTTTGCAGGAAACCAACCATGCACCATCATCAACAATGGAGATGTTCTTATGACAGAAAGTCCTCCA
2801 2880
GTCCCTGAAGCATTGCTGAGTAGGAGAGATCACTGTGAGGAAGCAGAATTATTGAACAATGCATCTATCCAAGAGGATAC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GTCCCTGAAGCATTGCTGAGTAGGAGAGATCACTGTGAGGAAGCAGAATTATTGAACAATGCATCTATCCAAGAGGATAC
2881 2960
AAAGCGTTTCAGCAGACCCAAGAGCCAAACCCTACCTCCCTTGTCTGACAGCAGTGGCTGGAACTCACTGGAAACAGGTG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AAAGCGTTTCAGCAGACCCAAGAGCCAAACCCTACCTCCCTTGTCTGACAGCAGTGGCTGGAACTCACTGGAAACAGGTG
2961 3040
CTTTCAGAGTGAAAGCAACTTTTGCAGATGAGAAGATCCGCTTCAGCTTGCAACCCATTTGGGGGTTCAGTGATTTGCAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CTTTCAGAGTGAAAGCAACTTTTGCAGATGAGAAGATCCGCTTCAGCTTGCAACCCATTTGGGGGTTCAGTGATTTGCAG
3041 3120
CTAGAGATAGCAAGACGGTTCAATCTAAATGATGTGACCAACATCCTTCTCAAGTATCTGGATGATGATGGAGAGTGGGT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CTAGAGATAGCAAGACGGTTCAATCTAAATGATGTGACCAACATCCTTCTCAAGTATCTGGATGATGATGGAGAGTGGGT
3121 3200
AGTTTTAGCATGTGATGGTGATCTTGAGGAGTGCAAAGACATACACAGATCATCTCAGAGCCGGACAATTAGACTCTCCC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AGTTTTAGCATGTGATGGTGATCTTGAGGAGTGCAAAGACATACACAGATCATCTCAGAGCCGGACAATTAGACTCTCCC
3201 3280
TTTTTCAAGCTTCGCCTCTAAATCTAGCTAATACATTCCGCAATAGCAGCCCATCTTAACATAACACCAATTCTCTGCTT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTCAAGCTTCGCCTCTAAATCTAGCTAATACATTCCGCAATAGCAGCCCATCTTAACATAACACCAATTCTCTGCTT
3281 3360
TTCTTGAGCTTTAAGCCTCAAGAGAGAGGTGTGTTTGTTTGTGTATACTGTGTAGGATCCCAATCCCAATCCCATCCCAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTCTTGAGCTTTAAGCCTCAAGAGAGAGGTGTGTTTGTTTGTGTATACTGTGTAGGATCCCAATCCCAATCCCATCCCAA
3361 3440
GCAAACATATACATTTAATTACTGGTTTTACTTGTTTG-----------------A
|||||||||||||||||||||||||||||||||||||| |
GCAAACATATACATTTAATTACTGGTTTTACTTGTTTGAAAAAAAAAAAAAAAAAA
Homolog proteins
Next I download two homolog protein sequences by sending http
requests to the uniprot database and tried to align those to each other.
= Path('P51587.fasta').read_text().splitlines()
f = ''.join(f[1:])
P51587 = Path('G3SGJ0.fasta').read_text().splitlines()
f = ''.join(f[1:]) G3SGJ0
import urllib3
= "https://rest.uniprot.org/uniprotkb/P51587.fasta"
link1 = "https://rest.uniprot.org/uniprotkb/G3SGJ0.fasta"
link2 = urllib3.PoolManager()
http
= http.request("GET",link1)
f
= ''.join(str(http.request("GET",link1).data).split('\\n')[1:-1])
P51587 = ''.join(str(http.request("GET",link2).data).split('\\n')[1:-1]) G3SGJ0
But before we run the alignment of the proteins, we should change the current Needleman Wunsch implementation to use a substitution matrix instead of a generic scoring scheme. Below is the BLUSUM62 substituion matrix:
Code
= {
blosum62 'W', 'F'): 1, ('L', 'R'): -2, ('S', 'P'): -1, ('V', 'T'): 0,
('Q', 'Q'): 5, ('N', 'A'): -2, ('Z', 'Y'): -2, ('W', 'R'): -3,
('Q', 'A'): -1, ('S', 'D'): 0, ('H', 'H'): 8, ('S', 'H'): -1,
('H', 'D'): -1, ('L', 'N'): -3, ('W', 'A'): -3, ('Y', 'M'): -1,
('G', 'R'): -2, ('Y', 'I'): -1, ('Y', 'E'): -2, ('B', 'Y'): -3,
('Y', 'A'): -2, ('V', 'D'): -3, ('B', 'S'): 0, ('Y', 'Y'): 7,
('G', 'N'): 0, ('E', 'C'): -4, ('Y', 'Q'): -1, ('Z', 'Z'): 4,
('V', 'A'): 0, ('C', 'C'): 9, ('M', 'R'): -1, ('V', 'E'): -2,
('T', 'N'): 0, ('P', 'P'): 7, ('V', 'I'): 3, ('V', 'S'): -2,
('Z', 'P'): -1, ('V', 'M'): 1, ('T', 'F'): -2, ('V', 'Q'): -2,
('K', 'K'): 5, ('P', 'D'): -1, ('I', 'H'): -3, ('I', 'D'): -3,
('T', 'R'): -1, ('P', 'L'): -3, ('K', 'G'): -2, ('M', 'N'): -2,
('P', 'H'): -2, ('F', 'Q'): -3, ('Z', 'G'): -2, ('X', 'L'): -1,
('T', 'M'): -1, ('Z', 'C'): -3, ('X', 'H'): -1, ('D', 'R'): -2,
('B', 'W'): -4, ('X', 'D'): -1, ('Z', 'K'): 1, ('F', 'A'): -2,
('Z', 'W'): -3, ('F', 'E'): -3, ('D', 'N'): 1, ('B', 'K'): 0,
('X', 'X'): -1, ('F', 'I'): 0, ('B', 'G'): -1, ('X', 'T'): 0,
('F', 'M'): 0, ('B', 'C'): -3, ('Z', 'I'): -3, ('Z', 'V'): -2,
('S', 'S'): 4, ('L', 'Q'): -2, ('W', 'E'): -3, ('Q', 'R'): 1,
('N', 'N'): 6, ('W', 'M'): -1, ('Q', 'C'): -3, ('W', 'I'): -3,
('S', 'C'): -1, ('L', 'A'): -1, ('S', 'G'): 0, ('L', 'E'): -3,
('W', 'Q'): -2, ('H', 'G'): -2, ('S', 'K'): 0, ('Q', 'N'): 0,
('N', 'R'): 0, ('H', 'C'): -3, ('Y', 'N'): -2, ('G', 'Q'): -2,
('Y', 'F'): 3, ('C', 'A'): 0, ('V', 'L'): 1, ('G', 'E'): -2,
('G', 'A'): 0, ('K', 'R'): 2, ('E', 'D'): 2, ('Y', 'R'): -2,
('M', 'Q'): 0, ('T', 'I'): -1, ('C', 'D'): -3, ('V', 'F'): -1,
('T', 'A'): 0, ('T', 'P'): -1, ('B', 'P'): -2, ('T', 'E'): -1,
('V', 'N'): -3, ('P', 'G'): -2, ('M', 'A'): -1, ('K', 'H'): -1,
('V', 'R'): -3, ('P', 'C'): -3, ('M', 'E'): -2, ('K', 'L'): -2,
('V', 'V'): 4, ('M', 'I'): 1, ('T', 'Q'): -1, ('I', 'G'): -4,
('P', 'K'): -1, ('M', 'M'): 5, ('K', 'D'): -1, ('I', 'C'): -1,
('Z', 'D'): 1, ('F', 'R'): -3, ('X', 'K'): -1, ('Q', 'D'): 0,
('X', 'G'): -1, ('Z', 'L'): -3, ('X', 'C'): -2, ('Z', 'H'): 0,
('B', 'L'): -4, ('B', 'H'): 0, ('F', 'F'): 6, ('X', 'W'): -2,
('B', 'D'): 4, ('D', 'A'): -2, ('S', 'L'): -2, ('X', 'S'): 0,
('F', 'N'): -3, ('S', 'R'): -1, ('W', 'D'): -4, ('V', 'Y'): -1,
('W', 'L'): -2, ('H', 'R'): 0, ('W', 'H'): -2, ('H', 'N'): 1,
('W', 'T'): -2, ('T', 'T'): 5, ('S', 'F'): -2, ('W', 'P'): -4,
('L', 'D'): -4, ('B', 'I'): -3, ('L', 'H'): -3, ('S', 'N'): 1,
('B', 'T'): -1, ('L', 'L'): 4, ('Y', 'K'): -2, ('E', 'Q'): 2,
('Y', 'G'): -3, ('Z', 'S'): 0, ('Y', 'C'): -2, ('G', 'D'): -1,
('B', 'V'): -3, ('E', 'A'): -1, ('Y', 'W'): 2, ('E', 'E'): 5,
('Y', 'S'): -2, ('C', 'N'): -3, ('V', 'C'): -1, ('T', 'H'): -2,
('P', 'R'): -2, ('V', 'G'): -3, ('T', 'L'): -1, ('V', 'K'): -2,
('K', 'Q'): 1, ('R', 'A'): -1, ('I', 'R'): -3, ('T', 'D'): -1,
('P', 'F'): -4, ('I', 'N'): -3, ('K', 'I'): -3, ('M', 'D'): -3,
('V', 'W'): -3, ('W', 'W'): 11, ('M', 'H'): -2, ('P', 'N'): -2,
('K', 'A'): -1, ('M', 'L'): 2, ('K', 'E'): 1, ('Z', 'E'): 4,
('X', 'N'): -1, ('Z', 'A'): -1, ('Z', 'M'): -1, ('X', 'F'): -1,
('K', 'C'): -3, ('B', 'Q'): 0, ('X', 'B'): -1, ('B', 'M'): -3,
('F', 'C'): -2, ('Z', 'Q'): 3, ('X', 'Z'): -1, ('F', 'G'): -3,
('B', 'E'): 1, ('X', 'V'): -1, ('F', 'K'): -3, ('B', 'A'): -2,
('X', 'R'): -1, ('D', 'D'): 6, ('W', 'G'): -2, ('Z', 'F'): -3,
('S', 'Q'): 0, ('W', 'C'): -2, ('W', 'K'): -3, ('H', 'Q'): 0,
('L', 'C'): -1, ('W', 'N'): -4, ('S', 'A'): 1, ('L', 'G'): -4,
('W', 'S'): -3, ('S', 'E'): 0, ('H', 'E'): 0, ('S', 'I'): -2,
('H', 'A'): -2, ('S', 'M'): -1, ('Y', 'L'): -1, ('Y', 'H'): 2,
('Y', 'D'): -3, ('E', 'R'): 0, ('X', 'P'): -2, ('G', 'G'): 6,
('G', 'C'): -3, ('E', 'N'): 0, ('Y', 'T'): -2, ('Y', 'P'): -3,
('T', 'K'): -1, ('A', 'A'): 4, ('P', 'Q'): -1, ('T', 'C'): -1,
('V', 'H'): -3, ('T', 'G'): -2, ('I', 'Q'): -3, ('Z', 'T'): -1,
('C', 'R'): -3, ('V', 'P'): -2, ('P', 'E'): -1, ('M', 'C'): -1,
('K', 'N'): 0, ('I', 'I'): 4, ('P', 'A'): -1, ('M', 'G'): -3,
('T', 'S'): 1, ('I', 'E'): -3, ('P', 'M'): -2, ('M', 'K'): -1,
('I', 'A'): -1, ('P', 'I'): -3, ('R', 'R'): 5, ('X', 'M'): -1,
('L', 'I'): 2, ('X', 'I'): -1, ('Z', 'B'): 1, ('X', 'E'): -1,
('Z', 'N'): 0, ('X', 'A'): 0, ('B', 'R'): -1, ('B', 'N'): 3,
('F', 'D'): -3, ('X', 'Y'): -1, ('Z', 'R'): 0, ('F', 'H'): -1,
('B', 'F'): -3, ('F', 'L'): 0, ('X', 'Q'): -1, ('B', 'B'): 4
( }
def scoring_submat(s1,s2,gap,submatrix):
= len(s1)+1
l1 = len(s2)+1
l2 = np.zeros((l1,l2),dtype= int) # score matrix
mat = np.zeros((l1,l2),dtype= int) # path matrix
pathmat 0]=[ i *gap for i in range(l1)]# fill first col
mat[:,0,:]=[ j *gap for j in range(l2)]# fill first row
mat[for i in range(1,l1):
for j in range(1,l2):
=[]
n#### get substitution score
try:
= submatrix[(s1[i-1],s2[j-1])]
subval except KeyError:
try:
= submatrix[(s2[j-1],s1[i-1])]
subval except KeyError:
print("ERROR: 'AA not in substitution matrix'")
return
#### end submatrix check
-1,j-1] + subval) # match
n.append(mat[i-1] + gap) #top insertion
n.append(mat[i,j-1,j] + gap) #left deletion
n.append(mat[i
=max(n) # keep maximal score
mat[i,j]= np.argmax(n)
pathmat[i,j] return(pathmat)
def Needleman_Wunsch_submat(s1,s2,gap,submatrix):
### Scoring:
= scoring_submat(s1,s2,gap,submatrix)
pathmat ### Traceback :
= traceback(s1,s2,pathmat)
ref, read ### Print:
PrintAlignment(ref, read)
NW alignment of P51587 with G3SGJ0 using the blosum64 submatrix
Needleman_Wunsch_submat(P51587,G3SGJ0,gap,blosum62)
1 80
MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPII
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPII
81 160
FKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVV
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVV
161 240
CGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESL
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESL
241 320
KKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNL
||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
KKNDRFIASVTDSENTNQREATSHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNL
321 400
QKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVANQKPFESGSDKISKEVVPSLACEWSQLTL
|||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
QKVRTSKTRKKISHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVANQKPFESGSDKISKEVVPSLACEWSQLTL
401 480
SGLNGAQMEKIPLLHISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDC
|||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SGLNGAQMEKIPLLNISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDC
481 560
ILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDN
|||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||
ILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFGASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDN
561 640
GSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLH
||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GSWPATTTQTSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLH
641 720
SSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCEND
|||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||
SSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNDTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCEND
721 800
PKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLSNLVMISRGKESYKMSDKLKG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLSNLVMISRGKESYKMSDKLKG
801 880
NNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYMRVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSE
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYMRVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSE
881 960
ELFSDNENNFVFQVANERNNLALGNTKELHETDLTCVNEPIFKNSTMVLYGDTGDKQATQVSIKKDLVYVLAEENKNSVK
|||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||
ELFSDNENNFVFQVANERNNLALGNTKELHETDLTCVNEPIFKNPTMVLYGDTGDKQATQVSIKKDLVYVLAEENKNSVK
961 1040
QHIKMTLGQDLKSDISLNIDKIPEKNNDYMNKWAGLLGPISNHSFGGSFRTASNKEIKLSEHNIKKSKMFFKDIEEQYPT
||||||||||||||||||||||| |||||| |||||||||||||||||||||||||||||||||||||||||||||||||
QHIKMTLGQDLKSDISLNIDKIPDKNNDYMDKWAGLLGPISNHSFGGSFRTASNKEIKLSEHNIKKSKMFFKDIEEQYPT
1041 1120
SLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSDCKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEE
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSDCKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEE
1121 1200
SGSQFEFTQFRKPSYILQKSTFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQVDSSKQFEGTVEIKRKFAGLLKNDC
||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||
SGSQFEFTQFRKPSYILQKSTFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQLDSSKQFEGTVEIKRKFAGLLKNDC
1201 1280
NKSASGYLTDENEVGFRGFYSAHGTKLNVSTEALQKAVKLFSDIENISEETSAEVHPISLSSSKCHDSVVSMFKIENHND
|||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
NKSASGYLTDENEVGFRGFYSAHGAKLNVSTEALQKAVKLFSDIENISEETSAEVHPISLSSSKCHDSVVSMFKIENHND
1281 1360
KTVSEKNNKCQLILQNNIEMTTGTFVEEITENYKRNTENEDNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTD
|||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
KTVSEK-NKCQLILQNNIEMTTGTFVEEITENYKRNTENEDNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTD
1361 1440
QHNICLKLSGQFMKEGNTQIKEDLSDLTFLEVAKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAK
|||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||
QHNICLKLSGQFMKEGNTQIKEDLSDLTFLEVVKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAK
1441 1520
ESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTGNQLVTFQGQPERDEKIKEPT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTGNQLVTFQGQPERDEKIKEPT
1521 1600
LLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFSHQWAKTLKYREACKDLELACETIEITAAPKCKEMQNSLNN
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||
LLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFSHQWAKTLKYREACKDLELACETIEITTAPKCKEMQNSLNN
1601 1680
DKNLVSIETVVPPKLLSDNLCRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTS
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DKNLVSIETVVPPKLLSDNLCRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTS
1681 1760
VSQTSLLEAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYLSNSSMSNSYSYHSDEVYNDS
|||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VSQTSLREAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYLSNSSMSNSYSYHSDEVYNDS
1761 1840
GYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAYPQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAYPQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVG
1841 1920
PPAFRIASGKIVCVSHETIKKVKDIFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKV
|||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
PPAFRIASGKIVCVSHETIKKVKDMFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKV
1921 2000
FADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSANTCGIFSTASGKSVQVSDASLQNARQVF
||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||
FADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSTNTCGIFSTASGKSVQVSDASLQNARQVF
2001 2080
SEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLISQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVL
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLISQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVL
2081 2160
EEFDLIRTEHSLHYSPTSRQNVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQ
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
EEFDLIRTEHSLHYSPTSRQNVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQ
2161 2240
DKQQLVLGTKVSLVENIHVLGKEQASPKNVKMEIGKTETFSDVPVKTNIEVCSTYSKDSENYFETEAVEIAKAFMEDDEL
||||||||||||||||||||||||||| |||||||||| |||||||||||||||||||||||||||||||||||||||||
DKQQLVLGTKVSLVENIHVLGKEQASPENVKMEIGKTEAFSDVPVKTNIEVCSTYSKDSENYFETEAVEIAKAFMEDDEL
2241 2320
TDSKLPSHATHSLFTCPENEEMVLSNSRIGKRRGEPLILVGEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRL
||| |||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
TDSELPSHATHSLFTCPENEEMVLSNSRTGKRRGEPLILVGEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRL
2321 2400
FMHHVSLEPITCVPFRTTKERQEIQNPNFTAPGQEFLSKSHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTG
|||||||||||||||||||| ||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||
FMHHVSLEPITCVPFRTTKEHQEIQNPNFTAPGQEFLSESHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTG
2401 2480
RPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKNNSNQAVAVTFTKCEEEPLDL
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||
RPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKNNSNQAAAVTFTKCEEEPLDL
2481 2560
ITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPRISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESF
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPRISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESF
2561 2640
QFHTEDYFGKESLWTGKGIQLADGGWLIPSNDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPK
|||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||
QFHTEDYFGKESLWTGKGIQLADGGWLIPSSDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPK
2641 2720
EFANRCLSPERVLLQLKYRYDTEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANISETSSNKTSSADTQKVAIIE
||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
EFANRCLSPERVLLQLKYRYDMEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANISETSSNKTSSADTQKVAIIE
2721 2800
LTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVGSPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRP
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVGSPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRP
2801 2880
FPLPLSSLFSDGGNVGCVDVIIQRAYPIQWMEKTSSGLYIFRNEREEEKEAAKYVEAQQKRLEALFTKIQEEFEEHEENT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FPLPLSSLFSDGGNVGCVDVIIQRAYPIQWMEKTSSGLYIFRNEREEEKEAAKYVEAQQKRLEALFTKIQEEFEEHEENT
2881 2960
TKPYLPSRALTRQQVRALQDGAELYEAVKNAADPAYLEGYFSEEQLRALNNHRQMLNDKKQAQIQLEIRKAMESAEQKEQ
||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||| |||||||||||
TKPYLPSRALTRQQVRALQDGAELYEAVKNAADPANLEGYFSEEQLRALNNHRQMLNDKKQAQIQLEVRKAMESAEQKEK
2961 3040
GLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLLTEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPV
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLLTEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPV
3041 3120
SDEILFQIYQPREPLHFSKFLDPDFQPSCSEVDLIGFVVSVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLI
||||||| |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||
SDEILFQVYQPREPLHFSKFLDPDFQPSCSEVDLIGFVISVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLI
3121 3200
AASNLQWRPESKSGLLTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKLMHILHANDPKWSTPTKDCTS
||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AASNLQWRPESKSGLPTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKLMHILHANDPKWSTPTKDCTS
3201 3280
GPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSVSTPVSAQMTSKSCKGEKEIDDQKNCKKRRALDFLSRLPLP
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSVSTPVSAQMTSKSCKGEKEIDDQKNCKKRRALDFLSRLPLP
3281 3360
PPVSPICTFVSPAAQKAFQPPRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEK
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PPVSPICTFVSPAAQKAFQPPRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEK
3361 3440
QFISVSESTRTAPTSSEDYLRLKRRCTTSLIKEQESSQASTEECEKNKQDTITTKKYI
|||||||||||||||||||||||||| |||||||||||||||||||||||||||||||
QFISVSESTRTAPTSSEDYLRLKRRC-TSLIKEQESSQASTEECEKNKQDTITTKKYI
Nussinov RNA Secondary Structure
The Nussinov algorithm predicts RNA secondary sturcture based on a global base-pairing score of a RNA sequence \(S\) with bases:
\[ S_i \in \{A,U,G,C\} \]
Parings can occure between bases panning the subsequence \(S_{i,j}\) of a defined minimal length $|S(i,j)|_{min} \ge L; L $ where L is the minmal loopsize which is at least one (base).
The optimal pairing score \(S\) between \(i,j\) is evaluated by following optimisation:
\[ S(i,j) = max \begin{cases} S(i+1,j-1) + w(i,j) \\ S(i+1,j) \\ S(i,j-1) \\ Smax_{i<k<j}S(i,k)+S(k+1,j) \end{cases} \]
and stored in a Matrix \(M\) of dimensions \(n, m = |S|\) , with a zero diagonal as well as zero values at diagonals storing impossible pairings due loopsize.
the \(w\) function checks if the basepairs are complement
\[ w(i,j)= 0; (i,j)\in \{(G,C),(A,U),((U,G))\} \implies 1 \]
is implemented in following function complementary
def complementary(i,j, allowGU=False):
= (i,j)
pair = [("A","U"),("U","A"),("G","C"),("C","G")]
comp if allowGU:
"G","U"))
comp.append(("U","G"))
comp.append((if pair in comp:
return True
else:
return False
def OP(M,seq,loopsize):
for k in range(1,len(seq)):
for i in range(len(seq)-k):
= i+k
j if i >= j-loopsize:
= 0
M[i,j] else:
= M[i+1,j-1] + complementary(seq[i],seq[j])
ij_paired = M[i+1,j] # Si unpaired
i_unpaired = M[i,j-1] # Sj unpaired
j_unpaired = max([M[i,k] + M[k+1,j] for k in range(i,j)])
decomp = max(j_unpaired,i_unpaired,ij_paired,decomp)
M[i,j]return M
To determine the structure of the folded RNA by traceback, we first create an empty list of pairs P P. We initialize with i = 1 , j = n {i=1,j=n}. Then, we follow one of three scenarios.
If \(j \le i\), the procedure stops. If \(M (i,j) = M (i ,j − 1) M(i,j)=M(i,j-1)\), then set \(i=i , j =j-1\) and continue.
Otherwise, for all \(k : i ≤ k \< j k:i k\<j\), if \(S_k\) and \(S_j\) are complementary and \(M ( i , j ) = M ( i , k − 1 ) + M ( k + 1 , j − 1 ) + 1 M(i,j)=M(i,k-1)+M(k+1,j-1)+1\), append \(( k , j ) (k,j)} to P\), then traceback both with \(i = i , j = k − 1 i=i,j=k-1\) and \(i = k + 1 , j = j − 1 i=k+1,j=j-1\).
When the traceback finishes, P P contains all of the paired bases.
def traceback(i,j ,mat,seq,struc):
if j<i: return # : traceback finished!
elif mat[i,j] == mat[i+1,j]: # Rule 1
+1,j,mat,seq,struc)
traceback(ielif mat[i,j] == mat[i,j-1]: #Rule 2
-1,mat,seq,struc)
traceback(i,jelif mat[i,j] == mat[i+1,j-1] + complementary(seq[i],seq[j]): #Rule3
struc.append((i,j))+1,j-1,mat,seq,struc)
traceback(ielse:
for k in range(i,j):
if mat[i,j]== mat[i,k]+mat[k+1,j]:
traceback(i,k,mat,seq,struc)+1,j,mat,seq,struc)
traceback(kbreak
Main nussinov function:
import numpy as np
def Nussinov(sequence, loopsize):
= str()
headerif sequence.startswith(">"):
= ''.join(sequence.splitlines()[0])
header = ''.join(sequence.splitlines()[1:])
sequence
= len(sequence)
length =[] # empty list to store structure
structure = np.zeros((length,length),dtype=int) # initialize matrix
M # fill score matrix
OP(M,sequence,loopsize) 0,length-1,M,sequence,structure) # tb through matrix
traceback(= ["."]*length # create dots notiation
dots for i in structure:
min(i)]="("
dots[max(i)]=")"
dots[return header+ '\n' + sequence + '\n'+''.join(dots)
for i in range(1,4):
print( Nussinov("CCAGGAAAAUUAGG",loopsize=i))
CCAGGAAAAUUAGG
((.)).((.))...
CCAGGAAAAUUAGG
((..)((..)).).
CCAGGAAAAUUAGG
(((..(...)).))
=""">Human_tRNA_38.chr6 (Ile, AAT)
htrnaGGCTGGTTAGTTCAGTTGGTTAGAGCGTGGTGCTAATAACGCCAAGGTCGTGGGTTCGATCCCCATATCGGCC"""
#tRNA = ''.join(fasta.splitlines()[1:])
=htrna.replace("T","U")
tRNA= Nussinov(tRNA,loopsize=6)
struct print(struct)
>Human_tRNA_38.chr6 (Ile, AAU)
GGCUGGUUAGUUCAGUUGGUUAGAGCGUGGUGCUAAUAACGCCAAGGUCGUGGGUUCGAUCCCCAUAUCGGCC
(((..(..(..(.(..(((((.(.((((..(......))))))))(((((......))).)))))))))).))
(((((((..((((………)))).(((((…….)))))…..(((((…….))))))))))))
import matplotlib.pyplot as plt
import forgi.visual.mplotlib as fvm
import forgi
import forgi.graph.bulge_graph as fgb