 |
Some BioNLP Resources
(Please drop me a line if you like some more references to be added!)
Biology related corpora:
- Unannotated:
- Annotated:
- Fetchprot: The corpus consists of 190 full text journal articles of which 140 describe experimental evidence for tyrosine kinase activity in at least one protein. In total, wild types and 85 different mutants of 77 proteins are subject to experimental validation in 613 experiments.
- Yapex
Corpus: The corpus consists of a reference collection with 99
abstracts constaining 1745 annotated protein names. In addition a test
collection is offered. It comprises 101 abstracts containing 1966
annotated protein names.
- PennBioIE:
UPenn Biomedical Information Extraction Project. It contains 2257
PubMed abstracts that are annotated for paragraphs, sentences, tokens,
parts of speech, entities, and treebank structure.
- Genia (University of Tokyo)
- 2000 abstracts from Medline (POS-tagged)
- manual annotations for biological terms
- articles with MeSH terms: human, blood cell and trascription factor
- Beta version of tree-bank
- etc.
- PASTA Corpora (University of Sheffield)
- Annotated corpus for baseline evaluation (gzipped): 52 abstracts
- Annotated corpus for blind evaluation (gzipped): 61 abstracts.
Three annotated
data sets for IE (by Mark Craven), with the following information annotated:
- subcellular-localization(PROTEIN, LOCATION)
- disease-association(GENE, DISEASE)
- protein-interaction(PROTEIN, PROTEIN)
Medstract
Corpus (Brandeis University). Can be used for mainly two applications, i.e. acronym identification, and anaphora resolution.
Genic Interaction Corpora from Genic Interaction Extraction Challenge
BioCreAtIve Corpus (2004) Critical Assessment of Information
Extraction systems in Biology.
A Coreference
Corpus from the MEDCo Project at Institute for Infocomm Research,
Singapore.
|
| |
Terminological resources:
- Protein/Gene
names across Species -- Semi-automatically compiled from various
databases like SwissProt or SGD.
- AcroMed
is a database of biomedical acronyms and their associated long forms
(Brandeis University)
- Ontologies and Dictionaries
of Tissues, Signal Transduction, Kinase Family classification, Genes
and their products, ect. (University of Tokyo).
-
LocusLink: genome and proteome databases normalized gene names and
Gene Ontology annotations (used by SemGen to obtain gene names -
BioLink2004)
- HUGO (Human gene
nomenclature - 15000 currently approved human gene names and symbols.)
- Swiss-Prot: the
UniProt/Swiss-Prot Protein Knowledgebase is an annotated protein
sequence database
- GeneCards is a database of human genes, their products and their involvement in diseases
|
|