A database is an organized collection of data. For instance, a list with some of the movies that we like would be a movie database:
|movie1||The player||1992||Robert Altman|
|movie2||Cookie’s fortune||1999||Robert Altman|
|movie 3||The man who shot Liberty Valance||1962||John Ford|
- Entities: The kind of things that we want to store in a database. E.g.: Genes, DNA sequences, bibliographical references.
- Records: The particular things stored in the database. E.g.: The gene BRCA1
- Identifiers or key: The unique name that identifies a record
- Fields: The properties that an entity has. E.g.: The name, sequence and mutations of the gene
In the previous movie examples the entities stored were movies, the records stored were: The player, Cookie’s fortune and The man who shot Liberty Valance. The unique idenfiers were: movie1, movie2 and movie3.
So if we think on the database as a table, the table would store information about one entity, the fields would be the column headers and the records would be the table rows.
It is quite common to store different entities in a database. For instance we could store movies, actors and directors or genes, sequences and mutations. In that case, the different entities could be stored in different tables and the records on those tables would be related by their unique identifiers. That structure would comprise a relational database.
The databases usually provide mechanisms to store, search, retrieve and modify the data.
The data repositories more relevant to the biological sciences include:
- nucleotide and protein sequences
- protein structures
- genetic expression
Main sequence databases:
Main protein databases:
Some genome databases:
A sequence database is a collection of DNA or protein sequences with some extra relevant information. The main sequence databases are Genbank and EMBL. Originally they were just sequence collections, but they have grown to store different biological databases heavily interconnected and they provide powerful interfaces to search and browse the stored information.
The sequences submitted to any of those databases are shared between them, so any sequence could be retrieved in the european or the american database. But they differ in the tools to search and browse the data and in some databases that provide extra information to the raw sequences like: mutations, coded proteins, bibliographical references, etc.
These databases are growing at an ever increasing fast pace. In June of 2007 there were 73 million sequences in Genbank and in August of 2015 there were 187 millions.
The sequences are split in these databases in different sections to ease the search. Among others, there are sections for mRNAs, publised nucleotide sequences, genomes, and genes.
- It is a public repository, any one can send sequences to it.
- There are sequences of different qualities, anything submitted is stored.
- There could be multiple sequences for the same gene or for the same mRNA
- A sequence can have several versions that represent the modifications done by the authors.
Due to the huge amount of sequences stored to ease the search the databases are split in different divisions. These divisions follow two criteria: the species and type of sequence. Among the taxonomical divisions you can find: primate, rodent, other mammalian, invertebrate an others. The other divisions are related to the kind of sequences like: EST, WGS, HTGS, and many others. If you are looking for reads comming from the Next Generation Sequencing Technologies they are stored in a special division called SRA.
You can find more information about GenBank in its handbook.
In RefSeq there are only well annotated and good quality sequences. It stores genomic, transcript and protein sequences and links the sequences that belong to a gene. It just has one representative sequence for each mRNA in a particular organism and, thus, it will have as many sequences as different transcripts and proteins coded for a particular gene in a particular organism.
It is not the aim of ReqSeq to have any sequence, but just to have a collection of well curated sequences. It is a secondary database. Since RefSeq requires extra curation work it is not available for all organisms, but only for those with good quality sequences. As of July of 2016 it has 65M proteins and 15M transcripts for 60K organisms.
UniProt is a protein database that includes information divided in two sections: Swiss-Prot and TrEMBL. UniProt aims to store sequence and functional information for the proteins.
TrEMBL is automatically annotated while Swiss-Prot is reviewed manually by humans that add information by reviewing the literature. Due to this effort Swiss-Prot has information of a higher quality, but it has less sequences than TrEMBL.
UniProt also hosts Uniref. This database aims to store one representative sequence for each protein without taking into account the species of origin. It clusters all the similar proteins and picks one for every cluster as a representative. There are clusters created at 100%, 90% and 50% identities.
PubMed is a bibliographical database that comprises biomedical literature (MEDLINE), life science journals and on-line books. It is a good collection of publications related to biochemistry, cellular biology and medicine. As of 2016 PubMed stores 26 million citations.
For each record it stores:
There is a related database named PubMed Cental (PMC) that only includes citations of Free Access Journals. These citations include the complete text for the papers stored.
PDB, Protein Data Bank
PDB stores 3D structures for proteins and nucleic acids.
Access to the information in Genbank
Every database provides one or more methods to search and query the data. It is quite common to provide a web interface in which to do text searches with some keyword, author, ID or any other text. Genbank has a powerful query web interface.
Each database shows the results in one or several formats. For instance, the Genbank sequences can be obtained in several formats.
LOCUS EC750390 558 bp mRNA linear EST 03-JUL-2006 DEFINITION POE00005652 PL(light) Polytomella parva cDNA similar to frataxin protein -related, mRNA sequence. ACCESSION EC750390 VERSION EC750390.1 GI:110064507 KEYWORDS EST. SOURCE Polytomella parva ORGANISM Polytomella parva Eukaryota; Viridiplantae; Chlorophyta; Chlorophyceae;Chlamydomonadales; Chlamydomonadaceae; Polytomella. REFERENCE 1 (bases 1 to 558) AUTHORS Lee,R.W. and Borza,T. TITLE The colorless plastid of the green alga Polytomella parva: a repertoire of its functions JOURNAL Unpublished (2006) COMMENT Contact: TBestDB Departement de Biochimie, Universite de Montreal Montreal, Canada Email: firstname.lastname@example.org Plate: 4065. FEATURES Location/Qualifiers source 1..558 /organism="Polytomella parva" /mol_type="mRNA" /db_xref="taxon:51329" /clone_lib="PL(light)" ORIGIN 1 gcggccgctt tttttttttt tttttttttt ttttcgtccg ttatttcttt tttaagaatg 61 cagtcatctg tacatcgtca agtattcgga gtgttatctc gttttgtggg aaacaaagcg 121 ggtattttta caaagcataa tcatggtgtc tcaaggttgt cttcatgcac ttcgtcatgc 181 gtaaagatgt atactagcaa caaggccccc gaggatcttc aaacgttcca ccggcaagca 241 gacgaaactc tagagcaagt cactgaagcc cttgaaaact atgtagatga gcatgaagtg 301 gaaggcagcg acattgagca tacgcaagga gtgcttacta ttaagcttgg aactcttgga 361 agttatgtaa ttaataaaca gactcctaat aagcagatat ggttatcctc tcccgtcagt 421 ggacccttcc gatatgatct taaagaaggt gcctgggttt atgaacgggc tggcgaggct 481 cggcgcgagc ttatttctca attagaaaca gaaatttcgg atttagttgg tgtcgaatta 541 aagataagta actgaacg
ID EC750390; SV 1; linear; mRNA; EST; PLN; 558 BP. XX AC EC750390; XX DT 04-JUL-2006 (Rel. 88, Created) DT 04-JUL-2006 (Rel. 88, Last updated, Version 1) XX DE POE00005652 PL(light) Polytomella parva cDNA similar to frataxin DE protein-related, mRNA sequence. XX KW EST. XX OS Polytomella parva OC Eukaryota; Viridiplantae; Chlorophyta; Chlorophyceae; Chlamydomonadales; OC Chlamydomonadaceae; Polytomella. XX RN  RP 1-558 RA Lee R.W., Borza T.; RT "The colorless plastid of the green alga Polytomella parva: a repertoire of RT its functions"; RL Unpublished. XX DR UNILIB; 42732; 19932. XX CC Contact: TBestDB CC Departement de Biochimie, Universite de Montrealhttp://es.wikipedia.org/wiki/Base_de_datos CC Montreal, Canada CC Email: email@example.com CC Plate: 4065. XX FH Key Location/Qualifiers FH FT source 1..558 FT /organism="Polytomella parva" FT /mol_type="mRNA" FT /clone_lib="PL(light)" FT /db_xref="taxon:51329" FT /db_xref="UNILIB:42732" XX SQ Sequence 558 BP; 153 A; 105 C; 127 G; 173 T; 0 other; gcggccgctt tttttttttt tttttttttt ttttcgtccg ttatttcttt tttaagaatg 60 cagtcatctg tacatcgtca agtattcgga gtgttatctc gttttgtggg aaacaaagcg 120 ggtattttta caaagcataa tcatggtgtc tcaaggttgt cttcatgcac ttcgtcatgc 180 gtaaagatgt atactagcaa caaggccccc gaggatcttc aaacgttcca ccggcaagca 240 gacgaaactc tagagcaagt cactgaagcc cttgaaaact atgtagatga gcatgaagtg 300 gaaggcagcg acattgagca tacgcaagga gtgcttacta ttaagcttgg aactcttgga 360 agttatgtaa ttaataaaca gactcctaat aagcagatat ggttatcctc tcccgtcagt 420 ggacccttcc gatatgatct taaagaaggt gcctgggttt atgaacgggc tggcgaggct 480 cggcgcgagc ttatttctca attagaaaca gaaatttcgg atttagttgg tgtcgaatta 540 aagataagta actgaacg 558
Main fields in the Genbank format.
|Field||Description||Search in Entrez|
|Locus name||Unique sequence name||[ACCN]|
|Sequence length||Sequence Length||[SLEN]|
|Molecule Type||DNA, genomic, mRNA, etc.||[PROP]|
|Genbank Division||Division for the sequence||[PROP]|
|Modification Date||Date for the last edit||[MDAT]|
|Accession||Unique accession ID. It does not changes with modifications||[ACCN]|
|Version||Version number of the sequence||All fields|
|Keywords||keywords that describe the sequence||[KYWD]|
|Source||Common name for the source species||[ORGN]|
|Organism||Oficial name for the source species||[ORGN]|
|Reference||Related publications||[TITL] [AUTH] [JOUR]|
|Features||Regions of interest||[FKEY]|
The Accession is the unique identifier for a sequence record. An accession number applies to the complete record and is usually a combination of a letter(s) and numbers, such as a single letter followed by five digits (e.g., U12345) or two letters followed by six digits (e.g., AF123456).
The records in GenBank can be updated by an author request, accession numbers do not change, even if information in the record is changed. So, a sequence can have several versions in GenBank. Version is an unique identifier that represents a single, specific sequence in the GenBank database. If there is any change to the sequence data (even a single base), the version number will be increased, e.g., U12345.1 → U12345.2, but the accession portion will remain stable.
Features holds information about genes and gene products, as well as regions of biological significance reported in the sequence. These can include regions of the sequence that code for proteins and RNA molecules.
FEATURES Location/Qualifiers source 1..12401 /organism="Homo sapiens" /mol_type="genomic DNA" /db_xref="taxon:9606" /chromosome="17" /map="17q24.3" gene complement(<1..4774) /gene="SOX9-AS1" /note="SOX9 antisense RNA 1" /db_xref="GeneID:400618" /db_xref="HGNC:HGNC:49321" ncRNA complement(<4744..4774) /ncRNA_class="lncRNA" /gene="SOX9-AS1" /product="SOX9 antisense RNA 1, transcript variant 2" /inference="similar to RNA sequence (same species):RefSeq:NR_103737.1" /exception="annotated by transcript or proteomic data" /transcript_id="NR_103737.1" /db_xref="GeneID:400618" /db_xref="HGNC:HGNC:49321" gene 5001..10401 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /note="SRY-box 9" /db_xref="GeneID:6662" /db_xref="HGNC:HGNC:11204" /db_xref="MIM:608160" mRNA join(5001..5803,6700..6953,7524..10401) /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /product="SRY-box 9" /transcript_id="NM_000346.3" /db_xref="GeneID:6662" /db_xref="HGNC:HGNC:11204" /db_xref="MIM:608160" exon 5001..5803 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /inference="alignment:Splign:2.1.0" /number=1 CDS join(5373..5803,6700..6953,7524..8368) /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /note="SRY (sex-determining region Y)-box 9 protein; SRY-related HMG-box, gene 9; SRY (sex determining region Y)-box9" /codon_start=1 /product="transcription factor SOX-9" /protein_id="NP_000337.1" /db_xref="CCDS:CCDS11689.1" /db_xref="GeneID:6662" /db_xref="HGNC:HGNC:11204" /db_xref="MIM:608160" /translation="MNLLDPFMKMTDEQEKGLSGAPSPTMSEDSAGSPCPSGSGSDTE NTRPQENTFPKGEPDLKKESEEDKFPVCIREAVSQVLKGYDWTLVPMPVRVNGSSKNK PHVKRPMNAFMVWAQAARRKLADQYPHLHNAELSKTLGKLWRLLNESEKRPFVEEAER LRVQHKKDHPDYKYQPRRRKSVKNGQAEAEEATEQTHISPNAIFKALQADSPHSSSGM SEVHSPGEHSGQSQGPPTPPTTPKTDVQPGKADLKREGRPLPEGGRQPPIDFRDVDIG ELSSDVISNIETFDVNEFDQYLPPNGHPGVPATHGQVTYTGSYGISSTAATPASAGHV WMSKQQAPPPPPQQPPQAPPAPQAPPQPQAAPPQQPAAPPQQPQAHTLTTLSSEPGQS QRTHIKTEQLSPSHYSEQQQHSPQQIAYSPFNLPHYSPSYPPITRSQYDYTDHQNSSS YYSHAAGQGTGLYSTFTYMNPAQRPMYTPIADTSGVPSIPQTHSPQHWEQPVYTQLTR P" STS 5374..5670 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /standard_name="PMC34415P1" /db_xref="UniSTS:273201" exon 6700..6953 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /inference="alignment:Splign:2.1.0" /number=2 STS 6765..7647 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /standard_name="PMC351321P1" /db_xref="UniSTS:273242" STS 6771..7605 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /standard_name="MARC_71570-71571:1249593890:3" /db_xref="UniSTS:528379" STS 6876..8104 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /standard_name="Sox9" /db_xref="UniSTS:502689" exon 7524..10401 /gene="SOX9" /gene_synonym="CMD1; CMPD1; SRA1; SRXX2; SRXY10" /inference="alignment:Splign:2.1.0" /number=3
Microsoft Word files are not text files, they are binary files that happen to represent documents. These documents can include text among many other things like images, charts or formats.
Sequences in plain files
We could store the sequence in a text file by just writing the sequence. This files would had to include only IUPAC characters.
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTAC GGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGT AGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCC ATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAA
This kind of file is seldom used because it lacks any metadata to identify the sequence.
The Fasta file includes a name for the sequence and, optionally, some description. The sequence should be preceded by a line that starts with the symbol >. The name will be written after that symbol. Spaces are not allowed in the sequence name. If there is a description it will be found after a space in the same line.
Several sequences can be included in the same file.
It is one of the most common formats.
>sequence1_name description ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGC >sequence2_name description ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGA
If we want to include more information we could use the GenBank or EMBL formats. It is also very common in the sequences that come directly from a sequencing machine to include the quality information, for that purpose the most common format is FASTQ.
Wikipedia defines ontology as:
an ontology is a formal naming and definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse.
An ontology is a way of structure the knowledge by dividing it in the entities relevant to a particular field. For instance, we have been talking about sequences, so a term in our ontology could be sequence.
It is also quite common to create hierarchical ontologies. For instance, nucleotide sequence and protein sequence could be subterms of sequence. In an ontology the terms are precisely defined and, usually, there are no synonyms.
Defining the terms relevant to a field is very useful, specially if those terms are discussed and adopted by the whole community. Standard ontologies became powerful tools that enable automatic analyses and searches. Imagine that we want to look for all enzymes related to lipid metabolism in a database. Which term should we use for the search? Lipid metabolism? Fat degradation? Lipid catabolism?
There are different biological ontologies, but the main ones are maintained by the Gene Ontology Consortium. The GO terms are used to define gene functions. Each GO term has an unique ID and a definition. There are three aspects covered by three hierarchical ontologies:
- molecular function: molecular activities of gene products
- cellular component: where gene products are active
- biological process: pathways and larger processes made up of the activities of multiple gene products.
You can browse the GO hierarchies at a GO browser
A gene can have several GO terms associated with each ontology. A gene can be annotated with terms from different levels of the hierarchy.
The GO ontologies ease the search for information and allow complex automated analyses.
Plant related ontologies
The Planteome project curates some plant related ontologies.
The Plant Ontology deals with Plant Anatomical Entities and Plant Developmental Stages.
The [Plant Trait Ontology] curates terms related to measurable traits, and the [Plant Experimental Condition] deals with experimental conditions.