| Size: 24211 Comment:  | Size: 24587 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 7: | Line 7: | 
| 1. '''Link out''' to related references: When presenting information about a protein to a user on a web page, it is useful to display links to related information about the protein, such as further information about the protein sequence and sequence feature annotations (e.g. in SwissProt), Gene Ontology annotations, domains annotations (InterPro), etc. | 1. '''Link out''' to related references: When presenting information about a protein to a user on a web page, it is useful to display links to related information about the protein, such as further information about the protein sequence and sequence feature annotations (e.g. in !UniProt), Gene Ontology annotations, domains annotations (!InterPro), etc. | 
| Line 44: | Line 44: | 
| Provides ID translation services for Affy probe set IDs. Supports use case 3. | Provides ID translation services for Affy probe set IDs. Supports use case 3. PIR ID mapping[[BR]] http://pir.georgetown.edu/pirwww/search/idmapping.shtml[[BR]] Provides ID translation sservices for many different ID types. Supports use case 3 | 
| Line 57: | Line 61: | 
| http://www.genmapp.org has identifier mapping databases for many model organisms. http://ensembl.org provides species specific identifier mappings and these are accessible via the !BioMart web service. | 
Database Identifier Mapping use cases
Links between database identifiers are required for four specific use-cases. Most of the database identifiers are for molecules (proteins, small molecules), but identifiers for complexes, interactions, pathways, molecular states, etc. are also required, but are less important in the short to mid term (e.g. next 3-5 years).
Use cases:
- Unification during dataset merging: During a merge operation e.g. of two protein-protein interaction datasets from independently created databases, it is vital to recognize that two protein objects, one from each data source, represent the same protein molecule, even if the protein objects don’t share any database accession numbers. Unification requires knowledge of record type e.g. you cannot reliably use a gene ID to unify proteins (mostly because splice variants exist). 
- Link out to related references: When presenting information about a protein to a user on a web page, it is useful to display links to related information about the protein, such as further information about the protein sequence and sequence feature annotations (e.g. in UniProt), Gene Ontology annotations, domains annotations (InterPro), etc. 
- Identifier translation: Some analysis methods require specific translations from one set of identifiers to another. For instance, our ‘activity centers’ analysis requires translation from protein or gene identifiers in a pathway database to Affymetrix probe set identifiers or other gene expression array platform identifiers. 
- Searching for a favorite gene name: Preferred gene names used for querying a pathway database should return all genes/proteins with that name, if they exist in the database. Unlike database accession numbers, gene names are not guaranteed unique, thus cannot reliably be used for the other use cases. 
Links are available from many sources, but not every source addresses each use case (and none address all use cases). All services that allow all data to be downloaded can conceivably be used for all use cases with the help of a separate software system that can store different link types (e.g. unification links, link out links), although this also requires recognition of record type.
Mapping services
AliasServerBR http://cbi.labri.fr/outils/alias/ BR Tool for identifier translation using CRC64 hash of the protein sequence as a primary key. Provides unification, linkout and translation services for a handful (~35) species for proteins only. Supports use cases 2, 3.BR Freely available for download. Regularly updated.
MD Anderson GeneLinkBR http://bioinformatics.mdanderson.org/GeneLink.html BR ID translation and search service for human IDs (10 ID types). Supports use case 2, 3.
EnsMartBR http://www.ensembl.org/Multi/martview BR ID translation services for Ensembl genomes. Supports use case 2.
MatchMinerBR http://discover.nci.nih.gov/matchminer/html/index.jsp BR ID translation service for mouse and human. Supports use case 2, 3.
Ariadne Genomics ID Mapping ServiceBR http://www.ariadnegenomics.com/services/idmap.html BR Tool for identifier translation. Supports 7 species and maps between 12 different ID types mainly for proteins and genes.BR Commercial service, not available for downloadBR Supports use case 3
GeneLynxBR http://www.genelynx.org/ BR Provides linkout services for human, mouse and rat. Supports use case 2.
NetAffxBR http://www.affymetrix.com/products/software/specific/netaffx.affx BR Provides ID translation services for Affy probe set IDs. Supports use case 3.
PIR ID mappingBR http://pir.georgetown.edu/pirwww/search/idmapping.shtml[[BR]] Provides ID translation sservices for many different ID types. Supports use case 3
Databases
International Protein IndexBR http://www.ebi.ac.uk/IPI/IPIhelp.html BR A cross reference database for proteins in higher eukaryotic organisms (5 species). Provides protein and gene cross references. Supports use case 1.
Entrez GeneBR Provides detailed information on genes from multiple organisms including gene aliases and links to NCBI related resources. Supports use case 4 (and 2 to some degree).
UniProt (SwissProt, PIR, TrEMBL) provides some information on links to related resources and protein names.
http://www.genmapp.org has identifier mapping databases for many model organisms.
http://ensembl.org provides species specific identifier mappings and these are accessible via the BioMart web service.
(Originally from a document written for cPath (MSKCC Pathway Database) by Gary Bader, Wednesday March 9, 2005)
Recent comments:
Nicolas Le Novere - May.10.2006 on biopax discuss mailing list
Related to the issue of unique but non-resolvable URIs, I would like to introduce MIRIAM annotation framework. The annotation of models is an important part of the MIRIAM standard of curation. MIRIAM is described in: Le Novère N., Finney A., Hucka M., Bhalla U., Campagne F., Collado-Vides J., Crampin E., Halstead M., Klipp E., Mendes P., Nielsen P., Sauro H., Shapiro B., Snoep J.L., Spence H.D., Wanner B.L. (2005) Minimum Information Requested In the Annotation of biochemical Models (MIRIAM) Nature Biotechnology, 23: 1509-1515. In order to have homogenous annotation, we came up with a system using URIs and accessions to uniquely describe a data, but let the choice to the user when it comes to use this information. We are now developing a unique resource that will provide access to a unified set of URI, plus additional information, like regexp to validate the IDs, name and documentation about the data resources etc. The URIs are different from the physical locations, and a given URIs can be resolved into several alternative locations. http://www.ec-code.org/#1.1.1.1 can be transformed into: http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.1.1.1 http://www.genome.jp/dbget-bin/www_bget?ec:1.1.1.1 http://us.expasy.org/cgi-bin/nicezyme.pl?1.1.1.1 http://www.taxonomy.org/#9606 can be transformed into: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606 http://www.ebi.ac.uk/newt/display?search=9606 etc. We also set-up a system of deprecation, so that if several resources are merged (Soemthing that is fortunately more and more frequent), we will always retrieve the data based on the old URIs. The database (MySQL) is now ready, also not public yet (you can find snapshot of the URIs at: http://sbml.org/wiki/MIRIAM_URI_Set). As a real-life test, the new version of BioModels Database, planned to be released this month will use it to resolve all the model annotation. We started to develop webservices so that URIs can be resolved into URLs etc.
Gary Bader: May.10.2006
Hi Nicolas,
    So are you guys going to maintain a list of general "linkout" URLs?  Or are you just proposing a standard way of representing these and people can maintain their own lists?  Either would be great!  A lot of groups are maintaining internal lists.
For instance, the cPath list is stored in a SQL table and currently contains 24 entries in the form e.g.
'UniProt', 'http://www.pir.uniprot.org/cgi-bin/upEntry?id=%ID%', 'UniProt (Universal Protein Resource) is the world\'s most comprehensive catalog of information on proteins.'
Note that we have %ID% as a symbol to replace with the actual ID. Occasionally a resource will require an extra string at the end of the URL (e.g. ".html") - this was the case for HPRD until they recently changed their URL scheme.
Cytoscape also has a linkout plugin that reads a list of links from a file to populate a linkout menu e.g.
url.yeast.BioGRID=http://www.thebiogrid.org/search.php?keywords=%ID%&searchbutton=GO&organismid=4932
url.google=http://www.google.com/search?hl=en&q=%ID%&btnG=Google+Search
Also, I'm pretty sure that IntAct is maintaining a list of linkouts in addition to the OBO file of official database names made available as part of the PSI-MI controlled vocabularies. E.g.
http://cvs.sourceforge.net/viewcvs.py/intact/intactCore/data/controlledVocab/CvDatabase.def?rev=1.19&view=auto
(I'm not sure where the linkout list is, if it exists)
And Chris Lemer and the Northbears team (aMAZE group) created an OWL ontology describing databases for the KEGG2BioPAX converter.  It has around 200 databases in it (all the ones referenced by KEGG).  A few example entries are below. (the full file is available from the Northbears CVS server at biopax.data/data/extra/db-alias.owl
(access instructions are here http://www.northbears.org/biopax/quick_guide.shtml )
It would be great if we could create a standard format for all this information that we could use for 2 use cases:
1. Validating external references - both that the database name is recognized and that the IDs follow the correct pattern
2. Using to create hyperlinks to other resources for more information.
If you guys are willing to push this, I will definitely contribute as a reviewer, by inputing actual URLs and by supporting it in cPath, Cytoscape, and to propose it for use in validating BioPAX.  This would be immediately useful at least to the projects mentioned above.
If the format is complex, I would say that an additional requirement is to have a simple tab delimited version of whatever the format is (e.g. if it is in OWL) to encourage people to use it.
Thanks,
Gary
Example db-alias.owl entries:
  <RelationshipDatabaseReference rdf:ID="BioPAX-KEGG-PATHWAY-Ref">
    <referenceName rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >KEGG PATHWAY</referenceName>
    <referencedDatabase>
      <Database rdf:ID="KEGG-PATHWAY-database">
        <authority>
          <Organization rdf:ID="KEGG-organization">
            <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
            >KEGG</name>
          </Organization>
        </authority>
        <parentDatabase>
          <Database rdf:ID="KEGG-database">
            <authority rdf:resource="#KEGG-organization"/>
            <reference>
              <UnificationDatabaseReference rdf:ID="KEGG-itself">
                <referenceName rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
                >KEGG</referenceName>
                <referencedDatabase rdf:resource="#KEGG-database"/>
                <referenceName rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
                >KEGG Database</referenceName>
              </UnificationDatabaseReference>
            </reference>
            <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
            >KEGG Database</name>
          </Database>
        </parentDatabase>
        <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
        >KEGG PATHWAY</name>
        <reference>
          <UnificationDatabaseReference rdf:ID="KEGG-PATHWAY-itself">
            <referenceName rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
            >KEGG PATHWAY</referenceName>
            <referencedDatabase rdf:resource="#KEGG-PATHWAY-database"/>
          </UnificationDatabaseReference>
        </reference>
      </Database>
    </referencedDatabase>
  </RelationshipDatabaseReference>
-----
  <Organization rdf:ID="TIGR-organization">
    <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The Institute for Genomic Research</name>
  </Organization>
---
and the ontology is:
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns="http://www.northbears.org/db-alias.owl#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
  xml:base="http://www.northbears.org/db-alias.owl">
  <owl:Ontology rdf:about=""/>
  <owl:Class rdf:ID="Database">
    <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
    <rdfs:subClassOf>
      <owl:Restriction>
        <owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:cardinality>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:ID="name"/>
        </owl:onProperty>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:subClassOf>
      <owl:Restriction>
        <owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:cardinality>
        <owl:onProperty>
          <owl:ObjectProperty rdf:ID="authority"/>
        </owl:onProperty>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >A Database represents a structured collection of data. It is governed by one and only one authority which is an Organization.</rdfs:comment>
  </owl:Class>
  <owl:Class rdf:ID="Organization">
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >An Organization is a governing authority over a Database.</rdfs:comment>
    <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
    <rdfs:subClassOf>
      <owl:Restriction>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:about="#name"/>
        </owl:onProperty>
        <owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:cardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
  </owl:Class>
  <owl:Class rdf:ID="RelationshipDatabaseReference">
    <rdfs:subClassOf>
      <owl:Class rdf:ID="DatabaseReference"/>
    </rdfs:subClassOf>
    <owl:disjointWith>
      <owl:Class rdf:ID="UnificationDatabaseReference"/>
    </owl:disjointWith>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >A RelationshipDatabaseReference is a reference made by one Database to another which does not share the same identity but has a semantic link with the referring instance.</rdfs:comment>
  </owl:Class>
  <owl:Class rdf:about="#DatabaseReference">
    <rdfs:subClassOf>
      <owl:Restriction>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:ID="referenceName"/>
        </owl:onProperty>
        <owl:minCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:minCardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >A DatabaseReference is a reference made by one instance of Database to one instance of another Database. This reference can be one of 2 types: RelationshipDatabaseReference or UnificationDatabaseReference. A Reference can have several names pointing to the same Database.</rdfs:comment>
    <rdfs:subClassOf>
      <owl:Restriction>
        <owl:onProperty>
          <owl:ObjectProperty rdf:ID="referencedDatabase"/>
        </owl:onProperty>
        <owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:cardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
  </owl:Class>
  <owl:Class rdf:about="#UnificationDatabaseReference">
    <owl:disjointWith rdf:resource="#RelationshipDatabaseReference"/>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >A UnificationDatabaseReference is a reference made by one Database to another which shares the same identity. An entity of a certain type in the referencing database, references an entity of the same type in the referenced database.</rdfs:comment>
    <rdfs:subClassOf rdf:resource="#DatabaseReference"/>
  </owl:Class>
  <owl:ObjectProperty rdf:about="#authority">
    <rdfs:domain rdf:resource="#Database"/>
    <rdfs:range rdf:resource="#Organization"/>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The Organization that governs the Database.</rdfs:comment>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:about="#referencedDatabase">
    <rdfs:domain rdf:resource="#DatabaseReference"/>
    <rdfs:range rdf:resource="#Database"/>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The Database referenced by a DatabaseReferenced.</rdfs:comment>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:ID="reference">
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The Database referenced by the current Database.</rdfs:comment>
    <rdfs:domain rdf:resource="#Database"/>
    <rdfs:range>
      <owl:Class>
        <owl:unionOf rdf:parseType="Collection">
          <owl:Class rdf:about="#RelationshipDatabaseReference"/>
          <owl:Class rdf:about="#UnificationDatabaseReference"/>
        </owl:unionOf>
      </owl:Class>
    </rdfs:range>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:ID="parentDatabase">
    <rdfs:domain rdf:resource="#Database"/>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The Database embedding the current Database.</rdfs:comment>
    <rdfs:range rdf:resource="#Database"/>
  </owl:ObjectProperty>
  <owl:DatatypeProperty rdf:about="#name">
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
    <rdfs:domain>
      <owl:Class>
        <owl:unionOf rdf:parseType="Collection">
          <owl:Class rdf:about="#Database"/>
          <owl:Class rdf:about="#Organization"/>
        </owl:unionOf>
      </owl:Class>
    </rdfs:domain>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The name given by the entity to itself.</rdfs:comment>
  </owl:DatatypeProperty>
  <owl:DatatypeProperty rdf:about="#referenceName">
    <rdfs:domain rdf:resource="#DatabaseReference"/>
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The name given by a Database to a referenced Database.</rdfs:comment>
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
  </owl:DatatypeProperty>
  <owl:DatatypeProperty rdf:ID="relationshipType">
    <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >The type of relationship of a RelationshipDatabaseReference.</rdfs:comment>
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
    <rdfs:domain rdf:resource="#RelationshipDatabaseReference"/>
  </owl:DatatypeProperty>
</rdf:RDF>Nicolas Le Novere: May.10.2006
We will indeed maintain a general list of stable, maintained data
resources, whether databases or ontologies. The original goal was to help
modellers to to generate consistent annotation and resolve existing
annotation. Of course anyone will be able to propose new type of data or
data resource. Each data type is very similar to the cPath list you
mentioned:
names (1 official plus synonyms), e.g. "GO" "Gene Ontology"
URIs (1 official plus deprecated ), e.g. http://www.ec-code.org/
Locations: For each one, the URL of the resource, and the base URL used to
build an hyperlink to a precise data.
In addition, there is a regexp fo reach datatype, and documentation fo
rthe resources and the locations.
We are now developing webservices, so one can do something like:
getURI("UniProt","P12345"), and get back
http://www.uniprot.org/#P12345
or getURLs("http://www.uniprot.org/#P12345") and get back:
http://www.ebi.uniprot.org/entry/P12345
http://us.expasy.org/uniprot/P12345
http://www.pir.uniprot.org/cgi-bin/upEntry?id=P12345
> > A lot of groups are maintaining internal lists.
I see that. This is awesome. We will be able to shamefully steal all their
work. Thanks fo the info. We should definitively link to all those
efforts.
> > It would be great if we could create a standard format for all this
> > information that we could use for 2 use cases:
> > 1. Validating external references - both that the database name is
> > recognized and that the IDs follow the correct pattern
> > 2. Using to create hyperlinks to other resources for more information.
That is definitively our aim. Our current XML export is quite (too)
simple, but we wanted something easy to parse.
> > If you guys are willing to push this, I will definitely contribute as a
> > reviewer, by inputing actual URLs and by supporting it in cPath,
> > Cytoscape, and to propose it for use in validating BioPAX.  This would
> > be immediately useful at least to the projects mentioned above.
That would be an unvaluable help indeed. I heard you are coming at the EBI
at the end of June. Unfortunately I will be in Oslo those days, but you
should meet Camille Laibe, currently in charge of the development of
MIRIAM database and webservices.
> > If the format is complex, I would say that an additional requirement is
> > to have a simple tab delimited version of whatever the format is (e.g.
> > if it is in OWL) to encourage people to use it.
The master data is in MySQL tables. I don't think it is a huge problem to
generate various formats at all.
I have a feeling that our need is actually a very common one, and there is
maybe room for a more ambitious project, on "data integration for Systems
Biology", and maybe a grant application together with other partners ...
I was asked to write a review for Nature Cell Biology with Henning
Hermjakob. I could start to instill the idea to the community.Henning Hermjakob: May 12.2006
As you know, we have exactly the same problem, and I'd be interested in a general solution. What we can offer so far is the PSI MI xrefs (which you know), available in OBO format from http://psidev.sourceforge.net/mi/rel25/data/psi-mi25.obo (the sourceforge CVS is still buggy at the moment) and interactively and as a web service from http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI&termId=MI%3A0444&termName=database%20citation
Nicolas Le Novere: May 12.2006
Henning, We can certainly use your XRefs to feed MIRIAM database. I think our resources (MIRIAM and cPATH) present certain advantages for a general use by the community (In particular, there is a clear difference between a datatype and its instantiation in a database. Each data-type is defined by a perennial URI, independant of the various URLs). However we can definitively cross-fertilize, not only with the content, but also the structure. For instance, I noticed that you had definitions on each entry. Quite astonishingly, we did not think about that!! (Camille, one extra-task for you). Let's keep each other posted.
Alan Ruttenberg - Jun.19.2006 on biopax-discuss mailing list - post related to defining a controlled vocabulary of database names, to be used for xrefs to help match identical xrefs from different sources.
From Nicolas Le Novere - Jun.19.2006 on biopax-discuss mailing list
Le Novère N., Finney A., Hucka M., Bhalla U., Campagne F., Collado-Vides J., Crampin E., Halstead M., Klipp E., Mendes P., Nielsen P., Sauro H., Shapiro B., Snoep J.L., Spence H.D., Wanner B.L. (2005) Minimum Information Requested In the Annotation of biochemical Models (MIRIAM). Nature Biotechnology, 23: 1509-1515. http://dx.doi.org/10.1038/nbt1156 The new version of SBML has a formal way of using those UIRs: http://sbml.org/wiki/sbml-level-2-version-2.pdf See section 6. Since then, we have been developing a database to store those standard URIs and all the related information (name, synonyms, URLs, regular-expression to parse accession, associated documentation etc.) We developed WebServices to access the database. Therefore when a tool will parse an SBML file, it will be able to transform the URIs into physical URLs etc. Conversely, a tool will be able to generate the correct URI using the name of a resource and an accession. The database is not publicly available yet, but we already use it for BioModels Database purpose. Our plan is to release the whole lot by the end of August. But we could put a demo online before that. My suggestion is for BioPAX to use MIRIAM URIs. I see two great advantages to that: 1) our aim being exactly the same, it would eliminate some redundant work. 2) SBML and BioPAX constituents would be annotated using the same URIs, making the conversion so easier.
