Navigation
---  Home ---  Events
---  Lehre ---  Jobs
---  Forschung ---  Kontakt
---  Resourcen ---  Englisch
---  Suche

[Unilogo]

 Universität Stuttgart 
 Institut für Maschinelle Sprachverarbeitung 
 Corpora and Terminological resources for BioNLP by Jasmin Saric 
Home - Research - Bibliography - CV - BioNLP Resources
 
 

Some BioNLP Resources

(Please drop me a line if you like some more references to be added!)
Biology related corpora:
  • Unannotated:
  • Annotated:
    • Fetchprot: The corpus consists of 190 full text journal articles of which 140 describe experimental evidence for tyrosine kinase activity in at least one protein. In total, wild types and 85 different mutants of 77 proteins are subject to experimental validation in 613 experiments.
    • Yapex Corpus: The corpus consists of a reference collection with 99 abstracts constaining 1745 annotated protein names. In addition a test collection is offered. It comprises 101 abstracts containing 1966 annotated protein names.
    • PennBioIE: UPenn Biomedical Information Extraction Project. It contains 2257 PubMed abstracts that are annotated for paragraphs, sentences, tokens, parts of speech, entities, and treebank structure.
    • Genia (University of Tokyo)
      • 2000 abstracts from Medline (POS-tagged)
      • manual annotations for biological terms
      • articles with MeSH terms: human, blood cell and trascription factor
      • Beta version of tree-bank
      • etc.
    • PASTA Corpora (University of Sheffield)
    • Three annotated data sets for IE (by Mark Craven), with the following information annotated:
      • subcellular-localization(PROTEIN, LOCATION)
      • disease-association(GENE, DISEASE)
      • protein-interaction(PROTEIN, PROTEIN)
    • Medstract Corpus (Brandeis University).
      Can be used for mainly two applications, i.e. acronym identification, and anaphora resolution.
    • Genic Interaction Corpora from Genic Interaction Extraction Challenge
    • BioCreAtIve Corpus (2004) Critical Assessment of Information Extraction systems in Biology.

    • A Coreference Corpus from the MEDCo Project at Institute for Infocomm Research, Singapore.
 
Terminological resources:
  • Protein/Gene names across Species -- Semi-automatically compiled from various databases like SwissProt or SGD.
  • AcroMed is a database of biomedical acronyms and their associated long forms (Brandeis University)
  • Ontologies and Dictionaries of Tissues, Signal Transduction, Kinase Family classification, Genes and their products, ect. (University of Tokyo).
  • LocusLink: genome and proteome databases normalized gene names and Gene Ontology annotations (used by SemGen to obtain gene names - BioLink2004)
  • HUGO (Human gene nomenclature - 15000 currently approved human gene names and symbols.)
  • Swiss-Prot: the UniProt/Swiss-Prot Protein Knowledgebase is an annotated protein sequence database
  • GeneCards is a database of human genes, their products and their involvement in diseases
 
Last update: 2001-08-07 (www-admin@ims.uni-stuttgart.de)