Difference between revisions of "SequencingFormats"
Line 46: | Line 46: | ||
These files are created only on request. They have 4 columns: | These files are created only on request. They have 4 columns: | ||
− | Column 1 - Sequence | + | Column 1 - Sequence<br> |
− | Column 2 - Number of occurrences of the sequence in column 1 BEFORE filtering | + | Column 2 - Number of occurrences of the sequence in column 1 BEFORE filtering<br> |
− | Column 3 - Does the sequence have the 3' adaptor [1 = yes, 0 = No] | + | Column 3 - Does the sequence have the 3' adaptor [1 = yes, 0 = No]<br> |
− | Column 4 - Number of occurrences of the sequence in column 1 AFTER filtering | + | Column 4 - Number of occurrences of the sequence in column 1 AFTER filtering<br> |
== ELAND Alignment Files == | == ELAND Alignment Files == |
Revision as of 14:33, 11 January 2011
We typically provide the following files/formats as part of a typical data package for HiSeq as well as Genome Analyzer Sequencing Data.
Contents
FASTQ Files (Quality Score Files)
This file format is used frequently at the Sanger Institute to bundle a sequence and its quality data. You will find these files in the "QualityScore" folder of the dataset.
A Illumina FASTQ file normally uses four lines per sequence:
Line 1 - begins with a '@' character and is followed by a sequence identifier
Line 2 - the sequence
Line 3 - begins with a '+' character and is optionally followed by the same sequence identifier
Line 4 - the quality values for each base in the sequence in Line 2
An Example:
@WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
CTCTGCTGTCTCTTGTGTAAGAAGANANNNCNTTCT
+WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1
fffffffRcfad[ddffa]faaaaBBBBBBBBBBBB
The sequence identifier in the above example is "WICMT-SOLEXA_0043:7:1:1500:1199#0/1;1". Please note that the last 2 characters i.e. ";1" may only be found in datasets sequenced by us because we do not remove the filtered reads but indicate them with a 1 (not filtered i.e. good) or 0 (filtered i.e. bad). Please refer to the FAQ's for information on filtering criteria and what it means.
WICMT-SOLEXA | Instrument name |
---|---|
0043 | A unique random string for the whole run (pretty much meaningless) |
7 | flowcell lane |
1 | tile number within the flowcell lane |
1500 | 'x'-coordinate of the cluster within the tile |
1199 | 'y'-coordinate of the cluster within the tile |
#0 | index number for a multiplexed sample (0 for no indexing) |
/1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
;1 | Read was not filtered This bit of information is typically not present in FASTQ files created by Illumina pipeline |
Technical-Analysis Files
This file is a summary of the alignment results. Please refer to the QC section for more information about this file.
You will only receive this file if a genome alignment was requested.
QC_Report Files
This file summarizes the quality scores (overall as well as per cycle), number of reads and how many were filtered, number of sequences with adaptors and at what position/cycle the adaptor starts as well as the base composition for the entire lane on a cycle by cycles bases for the filtered and unfiltered sets.
Please refer to the QC section for details.
Tag Count Files
These files are created only on request. They have 4 columns:
Column 1 - Sequence
Column 2 - Number of occurrences of the sequence in column 1 BEFORE filtering
Column 3 - Does the sequence have the 3' adaptor [1 = yes, 0 = No]
Column 4 - Number of occurrences of the sequence in column 1 AFTER filtering
ELAND Alignment Files
These files are created if alignments are requested. They can be found in the "Eland_Results_Extended" folder.
The current file naming convention is ss_<LANE>_eland_extended_<GENOME>_<SEEDLENGTH>-<READLENGTH>.txt.
Since the alignment program has been evolving, we have had different formats at different time points. Please refer to this page for the different formats handed out by GTC - http://jura.wi.mit.edu/genomecorewiki/index.php/AlignmentFormat
YLF (Young Lab Format) Files
These files are created on request only. The can be found in the "Eland_Results_Extended" folder along with the alignment results.
UCSC WIG Format Files
These are files in a wig format that can be uploaded directly to the UCSC browser for data visualization. They are created only on request and can be found in the "Analysis" Folder.
Other Files
Depending on the level of collaboration and analysis performed, all other file formats can be found in the "Analysis" folder. These can include normalized counts for each mRNA, peak calls using MACS, miRNA counts, SAM/BAM files etc.