Sequence file format notes

Platforms:

SeqVerter automatically detects and reads Mac and UNIX files without any modification. On export, you may select the file type using the OS Target option in the Save as dialog box or the Export single dialog box.

Format categories:
Sequence file formats can be divided into two primary categories: single sequence and multiple sequence. Single sequence files support only one sequence per file, while multiple sequence files support one or more sequences per file.

Multiple sequence files can be further divided into two secondary categories: sequential and interleaved. In sequential formats, each sequence entry is written out completely before the next entry starts. In interleaved formats, a line of one sequence is followed by one line of the next sequence and so on. Sequential formats are typical of sequence libraries, while interleaved formats are typical of alignment files, such as those produced by programs like Clustal and PHYLIP.

Supported formats:

ABI (read only)
Clustal
DCSE (read only)
DNASIS
DNAStar/DNA*/Lasergene
DNAStrider
EMBL (read only)
FASTA
FASTA (Strict)
FASTA (Sequin)
GDE (read only)
GenBank
IBI/PUSTELL
MACAW (read only)
MSF
NEXUS
PHYLIP Interleaved
PIR/NBRF (read only)
Plain/Line/Text
SCF (read only)
SWISS-PROT (read only)
TreeCon

Other:
Unrecognized formats

Format notes: ABI
The ABI (*.abi and *.ab1) format is the format of Applied BioSystems sequencer trace files. The format is single sequence and is read only. SeqVerter can read native ABI sequencer files and extract the raw sequence data. SeqVerter can automatically trim the 5' and 3' ends of ABI sequence data on import. To set up automatic trimming, you must set trimming parameters using the "Trace files" tab of the "Options" dialog box.
SeqVerter does not include a trace file viewer/editor. However, the free SeqVerter component of the GeneStudio, molecular biology software suite includes an excellent viewer of trace files. In addition, the Contig editor component includes the ability to align and edit trace files.
A complete description of the ABI file format may be found in:
Raw Data File Formats, and the Digital and Analog Raw Data Streams of the ABI PRISM 377 DNA Sequencer (Clark Tibbetts, 1995, unpublished). http://www.cs.cmu.edu/afs/cs/project/genome/WWW/Papers/clark.html
Top

Format notes: Clustal
SeqVerter can read and write Clustal files. All of the sequences exported to this format must be the same length. To allow you to determine how the sequence lengths are adjusted and which columns are exported, SeqVerter calls the Aligned format options dialog box.
For more information on the Clustal format, please see: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
Top

Format notes: DCSE
DCSE (Dedicated Comparative Sequence Editor) is a multiple sequence alignment editor. SeqVerter reads DCSE alignment files. 
For more information on DCSE, see: http://rrna.uia.ac.be/dcse/
Top

Format notes: DNASIS
DNASIS is a single sequence format used by the DNASIS sequence analysis package. SeqVerter can read and write DNASIS files.
More information: http://www.miraibio.com/products/cat_bioinformatics/view_dnasismax/index.html
Top

Format notes: DNAStar/DNA*/Lasergene
DNAStar/DNA*/Lasergene is a single sequence format (http://www.dnastar.com).
DNAStar exports sequences in a variety of formats, some of which may be derived from other formats such as GenBank and EMBL. The DNAStar formats do not conform to GenBank and EMBL standards, however, and are therefore not readable by sequence software designed for those formats. SeqVerter can read all forms of DNAStar files including EMBL, GenBank and contig files, and writes the GenBank-derived DNAStar format.
Top

Format notes: DNAStrider
DNAStrider is a multiple sequence format. SeqVerter reads and writes the DNAStrider format. Each sequence entry has three header lines, starting with a semi-colon ';' in position 1 of the line. The second header line includes the sequence type, name, and size. The sequence data follows the third header line, and is terminated with a double forward slash, '//'. SeqVerter will also read the “binary” variation of the DNAStrider format.
Top

Format notes: EMBL
EMBL is a multiple sequence format. SeqVerter reads EMBL sequence files, but does not write them. EMBL files are made up of individual sequence entries, each of which consists of lines containing different kinds of information identified by the first two letters of the line. 
SeqVerter reads the following lines from EMBL files:
ID - Sequence name (locus) and type.
DE - Sequence definition (description),
AC - Sequence accession number.
SQ - Sequence data.
More information: http://ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
Top

Format notes: FASTA
FASTA is a common single and also a sequential multiple sequence format. SeqVerter supports FASTA reading and writing. There are a number of FASTA variations, but the essential format is quite simple. Each sequence entry has a sequence information line starting with a "greater than" sign (>), followed by the locus (sequence) name and optionally additional information on the same line. Any subsequent lines are assumed to be the sequence. A given sequence entry ends when a new sequence information line is found. Many implementations of the FASTA format limit the length of the locus (sequence) name to 8 characters and the sequence data line length to 80 characters.
SeqVerter will correctly read several FASTA variants, including entries with more than one sequence information line per entry, NCBI's gi format, and others. SeqVerter may not read sequences downloaded in FASTA format from NCBI's Entrez, because of incorrect definition line formatting: the definition line is wrapped at 80 characters, causing the definition line to be mixed in with the sequence data.
The FASTA format is the same as the FASTA (Strict) format except that it allows sequence longer sequence names
SeqVerter writes three variants of FASTA: 
FASTA (Strict) The information line contains only an eight-character locus (sequence) name. This restriction is designed to permit maximum portability
FASTA This format allows longer sequence names (up to 255 characters) of the locus (sequence) name mainly for use in GenBank BLAST searches. Please note that many programs may crash when you open this non-standard FASTA file.
FASTA (Sequin) For use with the Sequin submission tool
Top

Format notes: GDE
The GDE format is a multiple sequence format of the Genetic Data Environment program. SeqVerter reads GDE files.
Top

Format notes: GenBank
GenBank is both a single and a sequential multiple sequence format. SeqVerter reads and writes GenBank files. The GenBank format consists of fields delimited by keywords. SeqVerter supports all of the fields except FEATURES and COMMENTS. A full description may be found here: complete description of the current format is available at: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt. The following is a brief description of each entry field, taken from the GenBank documentation.

  • LOCUS A short mnemonic name for the entry, chosen to suggest the sequence's definition. Mandatory keyword, exactly one record. GenBank entries may have the ACCESSION number as the LOCUS name. 
  • DEFINITION A concise description of the sequence. Mandatory keyword, one or more records.
  • ACCESSION The primary accession number is a unique, unchanging code assigned to each entry. (Please use this code when citing information from GenBank.) Mandatory keyword, one or more records.
  • NID The unique nucleic acid identifier that has been assigned to the current version of the sequence data that are associated with the GenBank entry identified by a given primary accession number.
  • KEYWORDS Short phrases describing gene products and other information about an entry. Mandatory keyword in all annotated entries, one or more records.
  • SEGMENT Information on the order in which this entry appears in a series of discontinuous sequences from the same molecule. Optional keyword (only in segmented entries), exactly one record.
  • SOURCE Common name of the organism or the name most frequently used in the literature. Mandatory keyword in all annotated entries, one or more records, includes one sub-keyword.
  • ORGANISM Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines). Mandatory sub-keyword in all annotated entries, two or more records.
  • REFERENCE Citations for all articles containing data reported in this entry. Includes four sub-keywords and may repeat. Mandatory keyword, one or more records.
  • AUTHORS Lists the authors of the citation. Mandatory sub-keyword, one or more records.
  • TITLE Full title of citation. Optional sub-keyword (present in all but unpublished citations), one or more records.
  • JOURNAL Lists the journal name, volume, year, and page numbers of the citation. Mandatory sub-keyword, one or more records.
  • MEDLINE Provides the Medline unique identifier for a citation. Optional sub-keyword, one record.
  • REMARK Specifies the relevance of a citation to an entry. Optional sub-keyword, one or more records.
  • COMMENT Cross-references to other sequence entries, comparisons to other collections, notes of changes in LOCUS names, and other remarks. Optional keyword, one or more records, may include blank records.
  • FEATURES Table containing information on portions of the sequence that code for proteins and RNA molecules and information on experimentally determined sites of biological significance. Optional keyword, one or more records.
  • BASE COUNT Summary of the number of occurrences of each base code in the sequence. Mandatory keyword, exactly one record.
  • ORIGIN Specification of how the first base of the reported sequence is operationally located within the genome. Where possible, this includes its location within a larger genetic map. Mandatory keyword, exactly one record. The ORIGIN line is followed by sequence data (multiple records).
  • // Entry termination symbol. Mandatory at the end of an entry, exactly one record.

Top

Format notes: IBI/Pustell
IBI/Pustell is a single sequence file format derived from the pre-1990 GenBank standard, and is only available for export using Export single button. Support for the IBI/Pustell program was discontinued in the early 1990s. SeqVerter can read and write IBI/Pustell files.
The IBI/Pustell format is similar to the GenBank format. The major difference is in the file names. IBI/Pustell file names are limited to the DOS 8+3 format. In addition, all nucleotide file names must start with the letter 'C', and all amino acid file names must start with the letter 'A'. Furthermore, the locus must be the same as the file name, without the 'C' or 'A'.
When converting GenBank files to the IBI/Pustell format, SeqVerter removes Feature table entries that are incompatible with IBI/Pustell.
Top

Format notes: MACAW
MACAW is a multiple sequence alignment tool that used to be distributed by NCBI. SeqVerter can read aligned sequences from MACAW files and export them to e.g., PHYLIP format for phylogenetic analysis.
Top

Format notes: MSF
MSF is an interleaved multiple sequence alignment file format. This format is used by the formerly popular GCG program suite (http://www.accelrys.com/about/gcg.html). SeqVerter can read and write MSF files. All of the sequences exported to this format must be the same length. 
The Alignment processing dialog box is displayed on export to MSF files to allow you to determine how the sequences are made flush, and how gaps and non-standard symbols are exported.
Top

Format notes: NEXUS
NEXUS is a multiple sequence alignment file format. SeqVerter can read interleaved and sequential NEXUS files. SeqVerter can write NEXUS sequential files. All of the sequences exported to this format must be the same length.
The Alignment processing dialog box is displayed on export to NEXUS files to allow you to determine how the sequences are made flush, and how gaps and non-standard symbols are exported.
Top

Format notes: PHYLIP interleaved
SeqVerter can read and write PHYLIP interleaved files. All of the sequences exported to this format must be the same length.  More information: http://evolution.genetics.washington.edu/phylip/faq.html#format
The Alignment processing dialog box is displayed on export to NEXUS files to allow you to determine how the sequences are made flush, and how gaps and non-standard symbols are exported.
Note: SeqVerter cannot read PHYLIP sequential files. If you attempt to read PHYLIP sequential files, the results will be undefined.
Top

Format notes: PIR/NBRF
The PIR format is a multiple sequence format. SeqVerter reads PIR files.
Top

Format notes: Plain/Line/Text
Sequences in plain text files are not recognized by SeqVerter because of the potential for confusion with other file types. If you wish to import a plain text sequence file into SeqVerter, you must first use a text editor (e.g., Windows NotePad) to add a FASTA header line. The FASTA header line consists of a "Greater-than" sign (>) followed by the sequence name. 

For example, a plain text file consisting of the following:
ATATCCGTAGCATGCTAGCTAGCTGATGCATGCTAGCCAGTACTGACCGATCG
CGATATGATGCTAGCTAGCTGATCGATGCTAGCTAGCTAGTAGTAGCTAGCTA
CGATCATCAGCTAGCTAGTAGCTAGCTCA

Would become:
>MySeq
ATATCCGTAGCATGCTAGCTAGCTGATGCATGCTAGCCAGTACTGACCGATCG
CGATATGATGCTAGCTAGCTGATCGATGCTAGCTAGCTAGTAGTAGCTAGCTA
CGATCATCAGCTAGCTAGTAGCTAGCTCA

After you add the header and save the file, SeqVerter will import it as a FASTA file.
Note: The line length does not matter, and the lines do not have to be the same length. The sequence name must be all one word. Any characters in the name that come after a space will be imported as the sequence definition.
Top

Format notes: SCF
SCF format is an automated sequencer file format. The format is single sequence, and is read only. SeqVerter can read native SCF 2.x and 3.x sequencer files and extract the raw sequence data. SeqVerter can automatically trim the 5' and 3' ends of SCF sequence data on import. To set up automatic trimming, you must set trimming parameters using the "Trace files" tab of the Options dialog box.
SeqVerter does not include a trace file viewer/editor. However, the sequence analysis suite, GeneStudio Pro, includes the ability to align and edit multiple trace files.
SCF format is a binary format. See: http://staden.sourceforge.net/manual/formats_unix_2.html for a full description. 
Top

Format notes: SWISS-PROT
SWISS-PROT is a multiple sequence format. SeqVerter reads SWISS-PROT sequence files, but does not currently write them. SWISS-PROT files are very similar to EMBL files, but consist of amino acid sequences only. Files are made up of individual sequence entries, each of which consists of lines containing different kinds of information identified by the first two letters of the line. 
SeqVerter reads the following lines from SWISS-PROT files:

  • ID - Sequence name (locus) and type.
  • DE - Sequence definition (description),
  • AC - Sequence accession number.
  • SQ - Sequence data.

Top

Format notes: TreeCon
TreeCon is a sequential multiple sequence alignment file format. This format is the input format of the TreeCon program (http://bioinformatics.psb.ugent.be/psb/current_projects_soft.htm). SeqVerter can read and write TreeCon files. 
The Alignment processing dialog box is displayed on export to NEXUS files to allow you to determine how the sequences are made flush, and how gaps and non-standard symbols are exported.
Top

Unrecognized formats
When you try to import a file with a format that is not recognized by SeqVerter, a message box will warn you that the file cannot be opened. However, occasionally, SeqVerter will misidentify the file format as FASTA or other format. In this case, you may see strange characters in the sequence names, and garbled looking sequence in the "View Sequence" dialog box. Examination of such files with a text editor will usually reveal the nature of the problem.
Top