Difference between revisions of "Sumeet Gupta"
Line 7: | Line 7: | ||
Email: sgupta at wi dot mit dot edu <br> | Email: sgupta at wi dot mit dot edu <br> | ||
|} | |} | ||
+ | |||
+ | = Frequently Asked Questions = | ||
+ | == Solexa == | ||
+ | === What is the format of the quality scores files (fasta files)? === | ||
+ | |||
+ | @4:1:518:715 <br> | ||
+ | GATACCATAAAAGCTGGATCCTTCTTCAAGCATAA <br> | ||
+ | +4:1:518:715 <br> | ||
+ | hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdhhP <br> | ||
+ | |||
+ | @ID <br> | ||
+ | Sequence <br> | ||
+ | +ID <br> | ||
+ | Quality Scores for each base (String of characters) <br> | ||
+ | |||
+ | === What do the quality scores for each base mean? === | ||
+ | |||
+ | P(error) - Probability of the base call being incorrect. <br> | ||
+ | |||
+ | {| cellpadding="2" style="border:1px solid darkgray;" | ||
+ | |- | ||
+ | |Char||ASCII||Char-64||P(error) | ||
+ | |- | ||
+ | |;||59||-5||0.7597 | ||
+ | |- | ||
+ | |<||60||-4||0.7153 | ||
+ | |- | ||
+ | |=||61||-3||0.6661 | ||
+ | |- | ||
+ | |>||62||-2||0.6131 | ||
+ | |- | ||
+ | |?||63||-1||0.5573 | ||
+ | |- | ||
+ | |@||64||0||0.5 | ||
+ | |- | ||
+ | |A||65||1||0.4427 | ||
+ | |- | ||
+ | |B||66||2||0.3869 | ||
+ | |- | ||
+ | |C||67||3||0.3339 | ||
+ | |- | ||
+ | |D||68||4||0.2847 | ||
+ | |- | ||
+ | |E||69||5||0.2403 | ||
+ | |- | ||
+ | |F||70||6||0.2008 | ||
+ | |- | ||
+ | |G||71||7||0.1663 | ||
+ | |- | ||
+ | |H||72||8||0.1368 | ||
+ | |- | ||
+ | |I||73||9||0.1118 | ||
+ | |- | ||
+ | |J||74||10||0.0909 | ||
+ | |- | ||
+ | |K||75||11||0.0736 | ||
+ | |- | ||
+ | |L||76||12||0.0594 | ||
+ | |- | ||
+ | |M||77||13||0.0477 | ||
+ | |- | ||
+ | |N||78||14||0.0383 | ||
+ | |- | ||
+ | |O||79||15||0.0307 | ||
+ | |- | ||
+ | |P||80||16||0.0245 | ||
+ | |- | ||
+ | |Q||81||17||0.0196 | ||
+ | |- | ||
+ | |R||82||18||0.0156 | ||
+ | |- | ||
+ | |S||83||19||0.0124 | ||
+ | |- | ||
+ | |T||84||20||0.0099 | ||
+ | |- | ||
+ | |U||85||21||0.0079 | ||
+ | |- | ||
+ | |V||86||22||0.0063 | ||
+ | |- | ||
+ | |W||87||23||0.005 | ||
+ | |- | ||
+ | |X||88||24||0.004 | ||
+ | |- | ||
+ | |Y||89||25||0.0032 | ||
+ | |- | ||
+ | |Z||90||26||0.0025 | ||
+ | |- | ||
+ | |[||91||27||0.002 | ||
+ | |- | ||
+ | |\||92||28||0.0016 | ||
+ | |- | ||
+ | |]||93||29||0.0013 | ||
+ | |- | ||
+ | |^||94||30||0.001 | ||
+ | |- | ||
+ | |_||95||31||0.0008 | ||
+ | |- | ||
+ | |`||96||32||0.0006 | ||
+ | |- | ||
+ | |a||97||33||0.0005 | ||
+ | |- | ||
+ | |b||98||34||0.0004 | ||
+ | |- | ||
+ | |c||99||35||0.0003 | ||
+ | |- | ||
+ | |d||100||36||0.0003 | ||
+ | |- | ||
+ | |e||101||37||0.0002 | ||
+ | |- | ||
+ | |f||102||38||0.0002 | ||
+ | |- | ||
+ | |g||103||39||0.0001 | ||
+ | |- | ||
+ | |h||104||40||0.0001 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | === What is Eland? === | ||
+ | ELAND stands for E fficient L arge-Scale A lignment of N ucleotide D atabases. ELAND is a alignment tool developed by Illumina/Solexa which searches a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match. | ||
+ | |||
+ | === How can Eland be faster relative to some other alignment programs? === | ||
+ | Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.). This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for. This can be bound even further because the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free. | ||
+ | |||
+ | Combining these subsequences into 2 subsets rather than searching for 4 independent entries in their database speeds up the process as long as one set matches (matching criteria on the other set can be relaxed, ie, allowing for mismatches). That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}) |
Revision as of 11:28, 16 April 2009
Sumeet Gupta |
Contents
Frequently Asked Questions
Solexa
What is the format of the quality scores files (fasta files)?
@4:1:518:715
GATACCATAAAAGCTGGATCCTTCTTCAAGCATAA
+4:1:518:715
hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdhhP
@ID
Sequence
+ID
Quality Scores for each base (String of characters)
What do the quality scores for each base mean?
P(error) - Probability of the base call being incorrect.
Char | ASCII | Char-64 | P(error) |
; | 59 | -5 | 0.7597 |
< | 60 | -4 | 0.7153 |
= | 61 | -3 | 0.6661 |
> | 62 | -2 | 0.6131 |
? | 63 | -1 | 0.5573 |
@ | 64 | 0 | 0.5 |
A | 65 | 1 | 0.4427 |
B | 66 | 2 | 0.3869 |
C | 67 | 3 | 0.3339 |
D | 68 | 4 | 0.2847 |
E | 69 | 5 | 0.2403 |
F | 70 | 6 | 0.2008 |
G | 71 | 7 | 0.1663 |
H | 72 | 8 | 0.1368 |
I | 73 | 9 | 0.1118 |
J | 74 | 10 | 0.0909 |
K | 75 | 11 | 0.0736 |
L | 76 | 12 | 0.0594 |
M | 77 | 13 | 0.0477 |
N | 78 | 14 | 0.0383 |
O | 79 | 15 | 0.0307 |
P | 80 | 16 | 0.0245 |
Q | 81 | 17 | 0.0196 |
R | 82 | 18 | 0.0156 |
S | 83 | 19 | 0.0124 |
T | 84 | 20 | 0.0099 |
U | 85 | 21 | 0.0079 |
V | 86 | 22 | 0.0063 |
W | 87 | 23 | 0.005 |
X | 88 | 24 | 0.004 |
Y | 89 | 25 | 0.0032 |
Z | 90 | 26 | 0.0025 |
[ | 91 | 27 | 0.002 |
\ | 92 | 28 | 0.0016 |
] | 93 | 29 | 0.0013 |
^ | 94 | 30 | 0.001 |
_ | 95 | 31 | 0.0008 |
` | 96 | 32 | 0.0006 |
a | 97 | 33 | 0.0005 |
b | 98 | 34 | 0.0004 |
c | 99 | 35 | 0.0003 |
d | 100 | 36 | 0.0003 |
e | 101 | 37 | 0.0002 |
f | 102 | 38 | 0.0002 |
g | 103 | 39 | 0.0001 |
h | 104 | 40 | 0.0001 |
What is Eland?
ELAND stands for E fficient L arge-Scale A lignment of N ucleotide D atabases. ELAND is a alignment tool developed by Illumina/Solexa which searches a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match.
How can Eland be faster relative to some other alignment programs?
Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.). This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for. This can be bound even further because the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.
Combining these subsequences into 2 subsets rather than searching for 4 independent entries in their database speeds up the process as long as one set matches (matching criteria on the other set can be relaxed, ie, allowing for mismatches). That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB})