Difference between revisions of "SequencingFormats"
| Line 6: | Line 6: | ||
A Illumina FASTQ file normally uses four lines per sequence: | A Illumina FASTQ file normally uses four lines per sequence: | ||
| − | Line 1 - begins with a '@' character and is followed by a sequence identifier | + | Line 1 - begins with a '@' character and is followed by a sequence identifier<br> |
| − | Line 2 - the sequence | + | Line 2 - the sequence<br> |
| − | Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier | + | Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier <br> |
| − | Line 4 - the quality values for each base in the sequence in Line 2 | + | Line 4 - the quality values for each base in the sequence in Line 2<br> |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means. | + | An Example:<br> |
| + | @WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1<br> | ||
| + | CTCTGCTGTCTCTTGTGTAAGAAGANANNNCNTTCT<br> | ||
| + | +WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1<br> | ||
| + | fffffffRcfad[ddffa]faaaaBBBBBBBBBBBB<br> | ||
| + | |||
| + | |||
| + | The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means.<br> | ||
<table class='wikitable'> | <table class='wikitable'> | ||
Revision as of 14:06, 11 January 2011
We typically provide the following files/formats as part of a typical data package for HiSeq as well as Genome Analyzer Sequencing Data.
FASTQ Files (Quality Score Files)
This file format is used frequently at the Sanger Institute to bundle a sequence and its quality data.
A Illumina FASTQ file normally uses four lines per sequence:
Line 1 - begins with a '@' character and is followed by a sequence identifier
Line 2 - the sequence
Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier
Line 4 - the quality values for each base in the sequence in Line 2
An Example:
@WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
CTCTGCTGTCTCTTGTGTAAGAAGANANNNCNTTCT
+WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
fffffffRcfad[ddffa]faaaaBBBBBBBBBBBB
The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means.
| WICMT-SOLEXA | Instrument name |
|---|---|
| 0043 | A unique random string for the whole run (pretty much meaningless) |
| 7 | flowcell lane |
| 1 | tile number within the flowcell lane |
| 1500 | 'x'-coordinate of the cluster within the tile |
| 1199 | 'y'-coordinate of the cluster within the tile |
| #0 | index number for a multiplexed sample (0 for no indexing) |
| /1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
| ;1 | Read was not filtered This bit of information is typically not present in FASTQ files created by Illumina pipeline |