| Size: 6184 Comment:  | Size: 6085 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 1: | Line 1: | 
| = Enrichment Map Genesets = | ## page was renamed from UpToDateGeneSets/EmGeneSetsReadme #acl All:read = Enrichment Map Gene Sets = | 
| Line 5: | Line 7: | 
| * Enrichment Map Genesets are a set of Gene Set files in GMT format (compatible with [[http://www.broadinstitute.org/gsea/index.jsp| GSEA]]) updated '''monthly''' from original source locations available with: | * Enrichment Map Gene Sets are a set of Gene Set files in GMT format (compatible with [[http://www.broadinstitute.org/gsea/index.jsp| GSEA]]) updated '''monthly''' from original source locations available with: | 
| Line 9: | Line 11: | 
| * The GMT File format contains one Gene Set per line.  Each line contains: * Name (tab) Description (tab) Gene (tab) Gene (tab) ... * In our format: * Name = Gene Set Name | Gene Set Source | Gene Set Source identifier * Example --> ATP-dependent protein binding|GO|GO:0043008 '''OR''' arginine biosynthesis IV|HUMANCYC|ARGININE-SYN4-PWY * Description = Gene Set Name * Example --> ATP-dependent protein binding '''OR''' arginine biosynthesis IV * Gene = identified by one of the three possible identifiers (Entrez gene id, !UniProt accession or gene symbols) | |
| Line 12: | Line 22: | 
| || '''Source''' || '''File Origin''' || '''File Type''' || '''ID extracted''' || '''Frequency source is updated''' ||  '''Number of pathwayss''' || '''Notes''' || || KEGG || KEGG ftp site (July 2011) || gmt || symbol || static as of July 1, 2011 || 236 || Not available in biopax, available in flatfile, translated into gmt files || || Msigdb - c2|| static (needs to be updated manually) || gmt || Entrez gene || sporadically || Biocarta - 217<<BR>> Other - 47 || Only need other and Biocarta as all other sources are currently covered || || NCI || [[http://pid.nci.nih.gov/download.shtml|NCI]] || biopax || Entrez gene || sporadically || 219 pathways || || || {X} Biocarta || [[http://pid.nci.nih.gov/download.shtml|NCI]] || biopax || Entrez gene || static || 386 pathways || '''Biopax 3 - Complete Mess!''' - currently getting from Msigdb || || IOB || directly from IOB - static (July 2011) || biopax || Entrez gene || sporadically || 35 pathways - <<BR>> 10 are the same as !CellMap,<<BR>> 1 is the same as !NetPath|| need biopax pathways fixed so species info is correct but information is still extractable. || || !NetPath || [[www.netpath.org/browse]] (scripted grab of file numbered 1-25) || biopax || Entrez gene || static || 25 pathways - <<BR>> 12 are cancer pathways (10 are !CellMap) <<BR>> 13 are immunity pathways || need biopax pathways fixed so species info is correct but information is still extractable. || || !HumanCyc || scripted grab of zipped release from password protected website. || biopax || Uniprot || updated periodically || 249 Pathways || available in biopax level 2 and level 3 || || Reactome || scripted grab of zipped release from website || biopax || Uniprot || updated release || 1117 pathways (release 37) || No way of getting version of release from biopax file || || GO || scripted grab from EBI ftp site (human) || GAF || Uniprot || released once a month || 13,034 no GO IEA <<BR>> 15,181 with GO IEA || source is direct from original curator of annotations || || msigdb - c3 <<BR>> Specialty GMTs <<BR>> mirs, transcription factors || grab from Msigdb || gmt || Entrez gene || sporadically || 221 miRs <<BR>> 616 TFs || || | || '''Source''' || '''File Origin''' || '''File Type''' || '''ID extracted''' || '''Frequency source is updated''' ||  '''Number of pathways''' || || [[http://www.genome.jp/kegg/|KEGG]] || KEGG ftp site (July 2011) || GMT || Symbol || static as of July 1, 2011 || 236 || || [[http://www.broadinstitute.org/gsea/msigdb/index.jsp|Msigdb - c2]] <<BR>> (other + Biocarta) || manual download from Msigdb || GMT || Entrez gene || sporadically || Biocarta - 217<<BR>> Other - 47 || || [[http://pid.nci.nih.gov/|NCI]] || scripted download of zipped release from website || BioPAX || Entrez gene || sporadically || 219 pathways || || IOB || received directly from IOB - static (July 2011) || BioPAX || Entrez gene || sporadically || 35 pathways - <<BR>> 10 are the same as !CellMap,<<BR>> 1 is the same as !NetPath|| || [[http://www.netpath.org/browse/|NetPath]] || scripted download of files numbered 1-25 || BioPAX || Entrez gene || static || 25 pathways - <<BR>> 12 are cancer pathways (10 are !CellMap) <<BR>> 13 are immunity pathways || || [[http://humancyc.org/|HumanCyc]] || scripted download of zipped release from password protected website. || BioPAX || !UniProt || updated periodically || 249 Pathways || || [[http://www.reactome.org/ReactomeGWT/entrypoint.html|Reactome]] || scripted download of zipped release from website || BioPAX || !UniProt || updated release || 1117 pathways (release 37) || || [[http://www.ebi.ac.uk/GO/|GO]] || scripted download from EBI ftp site (human) || GAF || Uniprot || released once a month || 13,034 no GO IEA <<BR>> 15,181 with GO IEA || || [[http://www.broadinstitute.org/gsea/msigdb/index.jsp|Msigdb - c3]] <<BR>> Specialty GMTs <<BR>> mirs, transcription factors || manual download from Msigdb || GMT || Entrez gene || sporadically || 221 miRs <<BR>> 616 TFs || | 
| Line 25: | Line 34: | 
| || '''Source''' || '''File Origin''' || '''File Type''' || '''ID extracted''' || '''Frequency source is updated''' ||  '''Number of pathwayss''' || '''Notes''' || || KEGG || translated from Human using Homologene || gmt || Entrezgene || static as of July 1, 2011 || 236 || Not available in mouse specific format || || Msigdb - c2|| translated from Human using Homologene || gmt || Entrez gene || sporadically || total 880:<<BR>> Kegg -186<<BR>> Reactome - 430<<BR>> Biocarta - 217<<BR>> Other - 47 || Only need other and Biocarta as all other sources are currently covered || || NCI || translated from Human using Homologene || gmt || Entrez gene || sporadically || 219 pathways || || || IOB || translated from Human using Homologene || gmt || Entrez gene || sporadically || 35 pathways - <<BR>> 10 are the same as !CellMap,<<BR>> 1 is the same as !NetPath|| need biopax pathways fixed so species info is correct but information is still extractable. || || !NetPath || translated from Human using Homologene || gmt || Entrez gene || static || 25 pathways - <<BR>> 12 are cancer pathways (10 are !CellMap) <<BR>> 13 are immunity pathways || need biopax pathways fixed so species info is correct but information is still extractable. || || !HumanCyc || translated from Human using Homologene || gmt || Entrez gene || updated periodically || 249 Pathways || available as Mousecyc in biopax but when we parsed it we got a fraction of the pathways that are in human so chose to convert the human files instead || || Reactome || scripted grab of zipped release from website || biopax || Uniprot || updated release || 946 pathways (release 37) || No way of getting version of release from biopax file || || GO || scripted grab from MGI ftp site (human) || GAF || MGI || released once a month || 13,034 no GO IEA <<BR>> 15,181 with GO IEA || source is direct from original curator of annotations || || msigdb - c3 <<BR>> Specialty GMTs <<BR>> mirs, transcription factors || translated from Human using Homologene || gmt || Entrez gene || sporadically || 221 miRs <<BR>> 616 TFs || || | || '''Source''' || '''File Origin''' || '''File Type''' || '''ID extracted''' || '''Frequency source is updated''' ||  '''Number of pathways''' || || [[http://www.reactome.org/ReactomeGWT/entrypoint.html|Reactome]] || scripted download of zipped release from website || BioPAX || !UniProt || updated release || 946 pathways (release 37) || || [[http://www.informatics.jax.org/mgihome/GO/project.shtml|GO]] || scripted download from MGI ftp site (mouse) || GAF || MGI || released once a month || 14,563 no GO IEA <<BR>> 15,041 with GO IEA || || [[http://www.genome.jp/kegg/|KEGG]] || ''translated from Human using Homologene'' || GMT || Entrez gene || static as of July 1, 2011 || 236 || || [[http://www.broadinstitute.org/gsea/msigdb/index.jsp|Msigdb - c2]] <<BR>> (other + Biocarta)|| ''translated from Human using Homologene'' || GMT || Entrez gene || sporadically || total 880:<<BR>> Kegg -186<<BR>> Reactome - 430<<BR>> Biocarta - 217<<BR>> Other - 47 || || [[http://pid.nci.nih.gov/|NCI]] || ''translated from Human using Homologene'' || GMT || Entrez gene || sporadically || 219 pathways || || IOB || ''translated from Human using Homologene'' || GMT || Entrez gene || sporadically || 35 pathways - <<BR>> 10 are the same as !CellMap,<<BR>> 1 is the same as !NetPath|| || [[http://www.netpath.org/browse/|NetPath]] || ''translated from Human using Homologene'' || GMT || Entrez gene || static || 25 pathways - <<BR>> 12 are cancer pathways (10 are !CellMap) <<BR>> 13 are immunity pathways || || [[http://humancyc.org/|HumanCyc]] || ''translated from Human using Homologene'' || GMT || Entrez gene || updated periodically || 249 Pathways || | 
| Line 54: | Line 62: | 
| * In each <identifier> directory There are amalgamated gene set files: * AllPathways - contains all pathway sources in the Pathways directory * GOPathways - contains all GO (mf, bp, cc) and all Pathway sources in the Pathways directory. | * In each <identifier> directory There are amalgamated Gene Set files: * !AllPathways - contains all pathway sources in the Pathways directory * GOPathways - contains all GO (MF, BP, CC) and all Pathway sources in the Pathways directory. | 
| Line 58: | Line 66: | 
| == Creating customized Genesets == | == Creating customized Gene Sets == | 
Enrichment Map Gene Sets
Summary
- Enrichment Map Gene Sets are a set of Gene Set files in GMT format (compatible with GSEA) updated monthly from original source locations available with: - Entrez gene ids
- UniProt accessions 
- Gene symbols
 
- The GMT File format contains one Gene Set per line.  Each line contains: - Name (tab) Description (tab) Gene (tab) Gene (tab) ...
- In our format: - Name = Gene Set Name | Gene Set Source | Gene Set Source identifier - Example --> ATP-dependent protein binding|GO|GO:0043008 OR arginine biosynthesis IV|HUMANCYC|ARGININE-SYN4-PWY 
 
- Description = Gene Set Name - Example --> ATP-dependent protein binding OR arginine biosynthesis IV 
 
- Gene = identified by one of the three possible identifiers (Entrez gene id, UniProt accession or gene symbols) 
 
- Name = Gene Set Name | Gene Set Source | Gene Set Source identifier 
 
Sources
- Human 
| Source | File Origin | File Type | ID extracted | Frequency source is updated | Number of pathways | 
| KEGG ftp site (July 2011) | GMT | Symbol | static as of July 1, 2011 | 236 | |
|  Msigdb - c2  | manual download from Msigdb | GMT | Entrez gene | sporadically |   Biocarta - 217 | 
| scripted download of zipped release from website | BioPAX | Entrez gene | sporadically | 219 pathways | |
| IOB | received directly from IOB - static (July 2011) | BioPAX | Entrez gene | sporadically |  35 pathways -   | 
| scripted download of files numbered 1-25 | BioPAX | Entrez gene | static |   25 pathways -  | |
| scripted download of zipped release from password protected website. | BioPAX | UniProt | updated periodically | 249 Pathways | |
| scripted download of zipped release from website | BioPAX | UniProt | updated release | 1117 pathways (release 37) | |
| scripted download from EBI ftp site (human) | GAF | Uniprot | released once a month |  13,034 no GO IEA  | |
|  Msigdb - c3  | manual download from Msigdb | GMT | Entrez gene | sporadically |  221 miRs  | 
- Mouse 
| Source | File Origin | File Type | ID extracted | Frequency source is updated | Number of pathways | 
| scripted download of zipped release from website | BioPAX | UniProt | updated release | 946 pathways (release 37) | |
| scripted download from MGI ftp site (mouse) | GAF | MGI | released once a month |  14,563 no GO IEA  | |
| translated from Human using Homologene | GMT | Entrez gene | static as of July 1, 2011 | 236 | |
|  Msigdb - c2  | translated from Human using Homologene | GMT | Entrez gene | sporadically |   total 880: | 
| translated from Human using Homologene | GMT | Entrez gene | sporadically | 219 pathways | |
| IOB | translated from Human using Homologene | GMT | Entrez gene | sporadically |  35 pathways -   | 
| translated from Human using Homologene | GMT | Entrez gene | static |   25 pathways -  | |
| translated from Human using Homologene | GMT | Entrez gene | updated periodically | 249 Pathways | 
File Structure
< > denotes directory
- <Release> - directory is named according to date sets were updated. - <Species> - <Identifier> - (either Entrez gene, UniProt, Gene symbol) - <GO> - BP = biological process
- MF = molecular function
- CC = Cellular component
- All = BP + MF + CC
- no_GO_IEA - indicates that the file excludes GO annotations with evidence codes - 'IEA' (inferred from electronic annotation), 'ND' (No biological data available), 'RCA' (inferred from reviewed computational analysis) 
- with_GO_IEA - indicates that the file includes GO annotations with evidence codes - 'IEA' (inferred from electronic annotation), 'ND' (No biological data available), 'RCA' (inferred from reviewed computational analysis) 
 
- <Pathways> 
- <miRs> 
- <TF> 
- <Disease phenotypes> 
 
 
 
- In each <identifier> directory There are amalgamated Gene Set files: - AllPathways - contains all pathway sources in the Pathways directory 
- GOPathways - contains all GO (MF, BP, CC) and all Pathway sources in the Pathways directory.
 
Creating customized Gene Sets
- Download the desired gene set files you would like to use in your customized set. (For example Human_IOB_Entrezgene.gmt Human_NetPath_Entrezgene.gmt ) 
cat Human_IOB_Entrezgene.gmt Human_NetPath_Entrezgene.gmt > MyCustomizedSet.gmt
