
JACIII Vol.19 No.6 pp. 843-851
doi: 10.20965/jaciii.2015.p0843


Protein Entity Name Recognition Using Orthographic, Morphological and Proteinhood Features

Sagara Sumathipala,Koichi Yamada, Muneyuki Unehara, and Izumi Suzuki

Graduate School of Engineering, Nagaoka University of Technology
1603-1 Kamitomioka-machi, Nagaoka, Niigata 940-2188, Japan

February 27, 2015
August 20, 2015
November 20, 2015
biomedical text mining, biomedical named entity, protein named entity, random forest
Protein name identification in text is an important and challenging fundamental precursor in biomedical information processing. For example, accurate identification of protein names affects the finding of protein-protein interactions from biomedical literature. In this paper, we present an efficient protein name identification technique based on a rich set of features: orthographic, morphological as well as Proteinhood features which are introduced newly in this study. The method was evaluated on GENIA corpus with the use of different machine learning algorithms. The highest values for precision 92.1%, recall 86.5% and F-measure 89.2% were achieved on Random Forest, while reducing the training and testing time significantly. We studied and showed the impact of the Proteinhood feature in protein identification as well as the effect of tuning the parameters of the machine learning algorithm.
