JACIII Vol.19 No.6 pp. 843-851
doi: 10.20965/jaciii.2015.p0843


Protein Entity Name Recognition Using Orthographic, Morphological and Proteinhood Features

Sagara Sumathipala, Koichi Yamada, Muneyuki Unehara, and Izumi Suzuki

Graduate School of Engineering, Nagaoka University of Technology
1603-1 Kamitomioka-machi, Nagaoka, Niigata 940-2188, Japan

February 27, 2015
August 20, 2015
Online released:
November 20, 2015
November 20, 2015
biomedical text mining, biomedical named entity, protein named entity, random forest

Protein name identification in text is an important and challenging fundamental precursor in biomedical information processing. For example, accurate identification of protein names affects the finding of protein-protein interactions from biomedical literature. In this paper, we present an efficient protein name identification technique based on a rich set of features: orthographic, morphological as well as Proteinhood features which are introduced newly in this study. The method was evaluated on GENIA corpus with the use of different machine learning algorithms. The highest values for precision 92.1%, recall 86.5% and F-measure 89.2% were achieved on Random Forest, while reducing the training and testing time significantly. We studied and showed the impact of the Proteinhood feature in protein identification as well as the effect of tuning the parameters of the machine learning algorithm.

  1. [1]  PubMed,
  2. [2]  C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, “GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles,” Bioinformatics, Vol.17, No.Suppl.1, S74-S82, 2001.
  3. [3]  M. Bundschus, M. Dejori, M. Stetter, V. Tresp, and H. P. Kriegel, “Extraction of semantic biomedical relations from text using conditional random fields,” BMC Bioinformatics, Vol.9, No.1, p.207, 2008.
  4. [4]  T. C. Rindflesch, L. Tanabe, J. N. Weinstein, and L. Hunter, “EDGAR: extraction of drugs, genes and relations from the biomedical literature,” Pacific Symp. on Biocomputing, p. 517, NIH Public Access, 2000.
  5. [5]  D. Zhou and Y. He, “Extracting interactions between proteins from the literature,” J. of Biomedical Informatics, Vol.41, No.2, pp. 393-407, 2008.
  6. [6]  Q. C. Bui, S. Katrenko, and P. M. Sloot, “A hybrid approach to extract protein-protein interactions,” Bioinformatics, Vol.27, No.2, pp.259-265, 2011.
  7. [7]  C. Blaschke, M. A. Andrade, C. A. Ouzounis, and A. Valencia, “Automatic extraction of biological information from scientific text: protein-protein interactions,” Ismb, Vol.7, pp. 60-67, 1999.
  8. [8]  M. Huang, X. Zhu, Y. Hao, D. G. Payan, K. Qu, and M. Li, “Discovering patterns to extract protein-protein interactions from full texts,” Bioinformatics, Vol.20, No.18, pp. 3604-3612, 2004.
  9. [9]  L. Ratinov and D. Roth, “Design challenges and misconceptions in named entity recognition,” Proc. of the 13th Conf. on Computational Natural Language Learning, pp. 147-155, Association for Computational Linguistics, 2009.
  10. [10]  B. M. Sundheim, “Overview of results of the MUC-6 evaluation,” Proc. of a Workshop on held at Vienna, Virginia: May 6-8, 1996, pp. 423-442, Association for Computational Linguistics, 1996.
  11. [11]  L. Yang and Y. Zhou, “Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs,” Knowledge and Information Systems, pp. 1-15, 2014.
  12. [12]  S. Sumathipala, K. Yamada, and M. Unehara, “Protein Name Classification Using Probabilistic Information of Orthographic and Morphological Features,” 22nd Symp. of SOFT Hokushinetsu Chapter, Nagaoka, Japan, 2013.
  13. [13]  H. C. Kuo and K. I. Lin, “Extracting Protein Names from Biological Literature,” Advances in Computer Science: an Int. J. Vol.3, No.2, pp. 58-68, 2014.
  14. [14]  G. Zhou, J. Zhang, J. Su, D. Shen, and C. Tan, “Recognizing names in biomedical texts: a machine learning approach,” Bioinformatics, Vol.20, No.7, pp. 1178-1190, 2004.
  15. [15]  S. Tatar and I. Cicekli, “Two learning approaches for protein name extraction,” J. of Biomedical Informatics, Vol.42, No.6, pp. 1046-1055, 2009.
  16. [16]  M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman, “Using BLAST for identifying gene and protein names in journal articles,” Gene, Vol.259, No.1, pp. 245-252, 2000.
  17. [17]  T. Mitsumori, S. Fation, M. Murata, K. Doi, and H. Doi, “Gene/protein name recognition based on support vector machine using dictionary as features,” BMC Bioinformatics, Vol.6, No.Suppl.1, S8, 2005.
  18. [18]  K. Seki and J. Mostafa, “A probabilistic model for identifying protein names and their name boundaries,” Proc. of the 2003 IEEE Bioinformatics Conf. 2003 (CSB 2003), pp. 251-258, 2003.
  19. [19]  Y. F. Lin, T. H. Tsai, W. C. Chou, K. P. Wu, T. Y. Sung, and W. L. Hsu, “A maximum entropy approach to biomedical named entity recognition,” Proc. of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, Seattle, WA, pp. 5661, 2004.
  20. [20]  R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. Ramani, and Y. W. Wong, “Learning to extract proteins and their interactions from medline abstracts,” 2003.
  21. [21]  Z. Ju, J. Wang, and F. Zhu, “Named entity recognition from biomedical text using SVM,” 2011 5th Int. Conf. on Bioinformatics and Biomedical Engineering (iCBBE), pp. 1-4, IEEE, 2011.
  22. [22]  K. J. Lee, Y. S. Hwang, S. Kim, and H. C. Rim, “Biomedical named entity recognition using two-phase model based on SVMs,” J. of Biomedical Informatics, Vol.37, No.6, pp. 436-447, 2004.
  23. [23]  S. Zhang and N. Elhadad, “Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts,” J. of Biomedical Informatics, Vol.46, No.6, pp. 1088-1098, 2013.
  24. [24]  F. Zhu and B. Shen, “Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing,” PloS one, Vol.7, No. 6, e39230, 2012.
  25. [25]  J. I. Kazama, T. Makino, Y. Ohta, and J. I. Tsujii, “Tuning support vector machines for biomedical named entity recognition,” Proc. of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, Vol.3, pp. 1-8, Association for Computational Linguistics, 2002.
  26. [26]  J. Patrick and Y. Wang, “Biomedical named entity recognition system,” Proc. of the 10th Australasian Document Computing Symp. (ADCS 2005), 2005.
  27. [27]  B. Settles, “Biomedical named entity recognition using conditional random fields and rich feature sets,” Proc. of the Int. Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 104-107, Association for Computational Linguistics, 2004.
  28. [28]  L. Li, R. Zhou, and D. Huang, “Two-phase biomedical named entity recognition using CRFs,” Computational Biology and Chemistry, Vol.33, No.4, pp. 334-338, 2009.
  29. [29]  X. Liu, S. Zhang, F. Wei, and M. Zhou, “Recognizing named entities in tweets,” Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol.1, pp. 359-367, Association for Computational Linguistics, 2011.
  30. 42] H. L. Chieu and H. T. Ng, “Named entity recognition: a maximum entropy approach using global information,” Proc. of the 19th Int. Conf. on Computational linguistics, Vol.1, pp. 1-7, Association for Computational Linguistics, 2002.
  31. [30]  K. Kageura, and B. Umino, “Methods of automatic term recognition: A review,” Terminology, Vol.3, No.2, pp. 259-289, 1996.
  32. [31]  I. H. Witten and E. Frank, “Data Mining: Practical machine learning tools and techniques,” Morgan Kaufmann, 2005.
  33. [32]  Genia, Term annotation, (1textsuperscriptst July 2015).
  34. [33]  U.S. National Library of Medicine, MEDLINEcircledR/ PubMedcircledR Resources, bsd/ pmresources.html, 2006.
  35. [34]  PubMed Help
  36. [35]  Bethesda (MD): National Center for Biotechnology Information (US); 2005-. PubMed Help. [Updated Mar 25, 2014]. Available from:
    http://www. ncbi.nlm.
  37. [36]  J. D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier, “Introduction to the bio-entity recognition task at JNLPBA,” Proc. of the Int. Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 70-75, Association for Computational Linguistics, 2004.
  38. [37]  G. F. Cooper and E. Herskovits, “A Bayesian method for the induction of probabilistic networks from data,” Machine Learning, Vol.9, No.4, pp. 309-347, 1992.
  39. [38]  R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine Learning, Vol.37, No.3, pp. 297-336, 1999.
  40. [39]  P. Harrington, “Machine learning in action,” Manning Publications Co., 2012.
  41. [40]  V. Vapnik, “The nature of statistical learning theory,” Springer, 2000.
  42. [41]  L. Breiman, “Random forests,” Machine Learning, Vol.45, No.1, pp.5-32, 2001.
  43. [42]  L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, “Classification and regression trees,” Wadsworth International Group, 1984.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, IE9,10,11, Opera.

Last updated on Mar. 24, 2017