Biopython has an inbuilt Bio.SeqIO module which provides functionalities to read and write sequences from or to a file respectively. Bio.SeqIO supports nearly all file handling formats used in Bioinformatics. Biopython strictly follows single approach to represent the parsed data sequence to the user with the SeqRecord object.
SeqRecord
SeqRecord object provided by the Bio.SeqRecord module holds the metadata of the sequence as well as the information about the sequence. Some main data information are listed below :
Record | Description |
---|---|
seq | An actual sequence to be parsed. |
id | Primary identity of the sequence, by default it is string type |
name | The name of the sequence, by default it is string type. |
description | Displays the information about the sequence in human-readable format. |
annotations | Dictionary containing additional information related to the sequence. |
Reading Sequence:
Biopython Seq module has a built-in read() method which takes a sequence file and turns it into a single SeqRecord according to the file format. It is able to parse sequence files having exactly one record, if the file has no records or more than one record then an exception is raised. Syntax and arguments of the read() method are given below :
Bio.SeqIO.read(handle, format, alphabet=None)
Arguments | Description |
---|---|
handle | Handle to file or takes filename as string(older versions only take handle) |
format | File; format as a string |
alphabet | Optional parameter, used when sequence type is not automatically inferred from file(ex. format = “fasta”). |
Python3
# Import libraries from Bio import SeqIO # Reading file record = SeqIO.read( "sequence.gb" , "genbank" ) # Showing records print ( "ID: %s" % record. id ) print ( "Sequence length: %i" % len (record)) print ( "Sequence description: %s" % record.description) |
Output:
Prasing Sequence:
The Parse() method provided by the Bio.Seq module is used when we have to read multiple records from the handle. It basically converts the sequence file into an iterator which returns the SeqRecords. If the file contains string data then it must be converted to handle to parse it. The file formats where alphabet can’t be determined, it is useful to specify the alphabet explicitly(ex. FASTA). Syntax and arguments of parse() method are given below :
Bio.SeqIO.parse(handle, format, alphabet=None)
Arguments | Description |
---|---|
handle | Handle to file or takes filename as string(older versions only take handle) |
format | File format as a string |
alphabet | The optional parameter, used when sequence type is not automatically inferred from file(ex. format = “fasta”). |
Python3
# Import libraries from Bio import SeqIO # Parsing file filename = "sequence.fasta" for record in SeqIO.parse(filename, "fasta" ): # Showing records print ( "ID: %s" % record. id ) print ( "Sequence length: %i" % len (record)) print ( "Sequence description: %s" % record.description) |
Output :
Writing to Sequence:
For writing to the file Bio.Seq module has a write() method, which writes the set of sequences to the file and returns an integer representing the number of records written. Ensure to close the handle after calling the handle else data gets flushed to disk. Syntax and arguments of write() method are given below :
Bio.SeqIO.write(sequences, handle, format)
Arguments | Description |
---|---|
sequences | List or iterator of SeqRecord object(or single SeqRecord in Biopython version 1.54 or later) |
handle | Handle to file or takes filename as string(older versions only take handle) |
format | File format to write as a lowercase string |
Note: To download files click here
Python3
# Import libraries from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord rec1 = SeqRecord(Seq( "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" + "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" + "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" ), id = "gi|14150838|gb|AAK54648.1|AF376133_1" , description = "chalcone synthase [Cucumis sativus]" ) rec2 = SeqRecord(Seq( "MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC" + "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" + "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" + "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" + "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" + "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" + "TGEGLEWGVLFGFGPGLTVETVVLHSVAT" ), id = "gi|13925890|gb|AAK49457.1|" , description = "chalcone synthase [Nicotiana tabacum]" ) sequences = [rec1, rec2] # Writing to file with open ( "example.fasta" , "w" ) as output_handle: SeqIO.write(sequences, output_handle, "fasta" ) for record in SeqIO.parse( "example.fasta" , "fasta" ): print ( "ID %s" % record. id ) print ( "Sequence length %i" % len (record)) |
Output: