Difference between revisions of "SequencingFormats"

From Genome Technology Core (GTC) wiki - Sequencing and Microarray
Jump to: navigation, search
 
Line 6: Line 6:
 
A Illumina FASTQ file normally uses four lines per sequence:
 
A Illumina FASTQ file normally uses four lines per sequence:
  
Line 1 - begins with a '@' character and is followed by a sequence identifier
+
Line 1 - begins with a '@' character and is followed by a sequence identifier<br>
Line 2 - the sequence
+
Line 2 - the sequence<br>
Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier  
+
Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier <br>
Line 4 - the quality values for each base in the sequence in Line 2
+
Line 4 - the quality values for each base in the sequence in Line 2<br>
  
An Example:
 
@WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
 
CTCTGCTGTCTCTTGTGTAAGAAGANANNNCNTTCT
 
+WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
 
fffffffRcfad[ddffa]faaaaBBBBBBBBBBBB
 
  
The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means.
+
An Example:<br>
 +
@WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1<br>
 +
CTCTGCTGTCTCTTGTGTAAGAAGANANNNCNTTCT<br>
 +
+WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1<br>
 +
fffffffRcfad[ddffa]faaaaBBBBBBBBBBBB<br>
 +
 
 +
 
 +
The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means.<br>
  
 
<table class='wikitable'>
 
<table class='wikitable'>

Revision as of 15:06, 11 January 2011

We typically provide the following files/formats as part of a typical data package for HiSeq as well as Genome Analyzer Sequencing Data.

FASTQ Files (Quality Score Files)

This file format is used frequently at the Sanger Institute to bundle a sequence and its quality data.

A Illumina FASTQ file normally uses four lines per sequence:

Line 1 - begins with a '@' character and is followed by a sequence identifier
Line 2 - the sequence
Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier
Line 4 - the quality values for each base in the sequence in Line 2


An Example:
@WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
CTCTGCTGTCTCTTGTGTAAGAAGANANNNCNTTCT
+WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
fffffffRcfad[ddffa]faaaaBBBBBBBBBBBB


The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means.

WICMT-SOLEXAInstrument name
0043A unique random string for the whole run (pretty much meaningless)
7flowcell lane
1tile number within the flowcell lane
1500'x'-coordinate of the cluster within the tile
1199'y'-coordinate of the cluster within the tile
#0index number for a multiplexed sample (0 for no indexing)
/1the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
;1Read was not filtered This bit of information is typically not present in FASTQ files created by Illumina pipeline