single-jc.php

JACIII Vol.28 No.6 pp. 1299-1312
doi: 10.20965/jaciii.2024.p1299
(2024)

Research Paper:

Improving Domain-Specific NER in the Indonesian Language Through Domain Transfer and Data Augmentation

Siti Oryza Khairunnisa*,† ORCID Icon, Zhousi Chen** ORCID Icon, and Mamoru Komachi** ORCID Icon

*Tokyo Metropolitan University
6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan

Corresponding author

**Hitotsubashi University
2-1 Naka, Kunitachi, Tokyo 186-8601, Japan

Received:
April 15, 2024
Accepted:
August 22, 2024
Published:
November 20, 2024
Keywords:
dataset creation, domain transfer learning, data augmentation, named entity recognition, Indonesian language
Abstract

Named entity recognition (NER) usually focuses on general domains. Specific domains beyond the English language have rarely been explored. In Indonesian NER, the available resources for specific domains are scarce and on small scales. Building a large dataset is time-consuming and costly, whereas a small dataset is practical. Motivated by this circumstance, we contribute to specific-domain NER in the Indonesian language by providing a small-scale specific-domain NER dataset, IDCrossNER, which is semi-automatically created via automatic translation and projection from English with manual correction for realistic Indonesian localization. With the help of the dataset, we could perform the following analyses: (1) cross-domain transfer learning from general domains and specific-domain augmentation utilizing GPT models to improve the performance of small-scale datasets, and (2) an evaluation of supervised approaches (i.e., in- and cross-domain learning) vs. GPT-4o on IDCrossNER. Our findings include the following. (1) Cross-domain transfer learning is effective. However, on the general domain side, the performance is more sensitive to the size of the pretrained language model (PLM) than to the size and quality of the source dataset in the general domain; on the specific-domain side, the improvement from GPT-based data augmentation becomes significant when only limited source data and a small PLM are available. (2) The evaluation of GPT-4o on our IDCrossNER demonstrates that it is a powerful tool for specific-domain Indonesian NER in a few-shot setting, although it underperforms in prediction in a zero-shot setting. Our dataset is publicly available at https://github.com/khairunnisaor/idcrossner.

Cite this article as:
S. Khairunnisa, Z. Chen, and M. Komachi, “Improving Domain-Specific NER in the Indonesian Language Through Domain Transfer and Data Augmentation,” J. Adv. Comput. Intell. Intell. Inform., Vol.28 No.6, pp. 1299-1312, 2024.
Data files:
References
  1. [1] M. Tang, P. Zhang, Y. He, Y. Xu, C. Chao, and H. Xu, “DoSEA: A Domain-Specific Entity-Aware Framework for Cross-Domain Named Entity Recogition,” Proc. of the 29th Int. Conf. on Computational Linguistics, pp. 2147-2156, 2022.
  2. [2] J. Zheng, H. Chen, and Q. Ma, “Cross-domain Named Entity Recognition via Graph Matching,” Findings of the Association for Computational Linguistics (ACL 2022), pp. 2670-2680, 2022. https://doi.org/10.18653/v1/2022.findings-acl.210
  3. [3] A. F. Aji, G. I. Winata, F. Koto, S. Cahyawijaya, A. Romadhony, R. Mahendra, K. Kurniawan, D. Moeljadi, R. E. Prasojo, T. Baldwin, J. H. Lau, and S. Ruder, “One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia,” Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, Vol.1: Long Papers, pp. 7226-7249, 2022. https://doi.org/10.18653/v1/2022.acl-long.500
  4. [4] B. Wilie, K. Vincentio, G. I. Winata, S. Cahyawijaya, X. Li, Z. Y. Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar, and A. Purwarianti, “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Proc. of the 1st Conf. of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Int. Joint Conf. on Natural Language Processing, pp. 843-857, 2020.
  5. [5] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” Proc. of the 28th Int. Conf. on Computational Linguistics, pp. 757-770, 2020. https://doi.org/10.18653/v1/2020.coling-main.66
  6. [6] S. Cahyawijaya, H. Lovenia, A. F. Aji, G. Winata, B. Wilie, F. Koto, R. Mahendra, C. Wibisono, A. Romadhony, K. Vincentio, J. Santoso, D. Moeljadi, C. Wirawan, F. Hudi, M. S. Wicaksono, I. Parmonangan, I. Alfina, I. F. Putra, S. Rahmadani, Y. Oenang, A. Septiandri, J. Jaya, K. Dhole, A. Suryani, R. A. Putri, D. Su, K. Stevens, M. N. Nityasya, M. Adilazuarda, R. Hadiwijaya, R. Diandaru, T. Yu, V. Ghifari, W. Dai, Y. Xu, D. Damapuspita, H. Wibowo, C. Tho, I. K. Karo, T. Fatyanosa, Z. Ji, G. Neubig, T. Baldwin, S. Ruder, P. Fung, H. Sujaini, S. Sakti, and A. Purwarianti, “NusaCrowd: Open Source Initiative for Indonesian NLP Resources,” Findings of the Association for Computational Linguistics (ACL 2023), pp. 13745-13818, 2023. https://doi.org/10.18653/v1/2023.findings-acl.868
  7. [7] D. Purwitasari, A. F. Abdillah, S. Juanita, and M. H. Purnomo, “Transfer Learning Approaches for Indonesian Biomedical Entity Recognition,” 2021 13th Int. Conf. on Information & Communication Technology and System (ICTS), pp. 348-353, 2021. https://doi.org/10.1109/ICTS52701.2021.9608496
  8. [8] M. Akmal and A. Romadhony, “Corpus Development for Indonesian Product Named Entity Recognition Using Semi-supervised Approach,” 2020 Int. Conf. on Data Science and Its Applications (ICoDSA), 2020. https://doi.org/10.1109/ICoDSA50139.2020.9212879
  9. [9] R. H. Gusmita, A. F. Firmansyah, D. Moussallem, and A.-C. Ngonga Ngomo, “IndQNER: Named Entity Recognition Benchmark Dataset From the Indonesian Translation Of the Quran,” Natural Language Processing and Information Systems: Proc. of 28th Int. Conf. on Applications of Natural Language to Information Systems (NLDB 2023), pp. 170-185, 2023. https://doi.org/10.1007/978-3-031-35320-8_12
  10. [10] Z. Liu, Y. Xu, T. Yu, W. Dai, Z. Ji, S. Cahyawijaya, A. Madotto, and P. Fung, “CrossNER: Evaluating Cross-Domain Named Entity Recognition,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.35, No.15, pp. 13452-13460, 2021. https://doi.org/10.48550/arXiv.2012.04373
  11. [11] M. Artetxe, G. Labaka, and E. Agirre, “Bilingual Lexicon Induction through Unsupervised Machine Translation,” Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5002-5007, 2019.
  12. [12] J. Xie, Z. Yang, G. Neubig, N. A. Smith, and J. Carbonell, “Neural Cross-Lingual Named Entity Recognition with Minimal Resources,” Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing, pp. 369-379, 2018.
  13. [13] S. Wu and M. Dredze, “Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT,” Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), pp. 833-844, 2019.
  14. [14] X. Dai and H. Adel, “An Analysis of Simple Data Augmentation for Named Entity Recognition,” Proc. of the 28th Int. Conf. on Computational Linguistics, pp. 3861-3867, 2020. https://doi.org/10.48550/arXiv.2010.11683
  15. [15] U. Phan and N. Nguyen, “Simple Semantic-based Data Augmentation for Named Entity Recognition in Biomedical Texts,” Proc. of the 21st Workshop on Biomedical Language Processing, pp. 123-129, 2022.
  16. [16] B. Ding, L. Liu, L. Bing, C. Kruengkrai, T. H. Nguyen, S. Joty, L. Si, and C. Miao, “DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks,” Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 6045-6057, 2020.
  17. [17] L. Liu, B. Ding, L. Bing, S. Joty, L. Si, and C. Miao, “MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER,” Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing, Vol.1: Long Papers, pp. 5834-5846, 2021.
  18. [18] R. I. Doğan, R. Leaman, and Z. Lu, “NCBI disease corpus: A resource for disease name recognition and concept normalization,” J. Biomed. Inform., Vol.47, pp. 1-10, 2014. https://doi.org/10.1016/j.jbi.2013.12.006
  19. [19] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition,” Proc. of the Seventh Conf. on Natural Language Learning at HLT-NAACL 2003, pp. 142-147, 2003.
  20. [20] C. Jia, X. Liang, and Y. Zhang, “Cross-Domain NER using Cross-Domain Language Modeling,” Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2464-2474, 2019.
  21. [21] B. Y. Lin and W. Lu, “Neural Adaptation Layers for Cross-domain Named Entity Recognition,” Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing, pp. 2012-2022, 2018.
  22. [22] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks,” Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342-8360, 2020.
  23. [23] C. Jia and Y. Zhang, “Multi-Cell Compositional LSTM for NER Domain Adaptation,” Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5906-5917, 2020.
  24. [24] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, and Z. Lu, “BioCreative V CDR task corpus: A resource for chemical disease relation extraction,” Database: J. Biological Databases and Curation, Vol.2016, 2016.
  25. [25] D. Hoesen and A. Purwarianti, “Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger,” 2018 Int. Conf. on Asian Language Processing (IALP), pp. 35-38, 2018. https://doi.org/10.1109/IALP.2018.8629158
  26. [26] S. O. Khairunnisa, A. Imankulova, and M. Komachi, “Towards a Standardized Dataset on Indonesian Named Entity Recognition,” Proc. of the 1st Conf. of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Int. Joint Conf. on Natural Language Processing: Student Research Workshop, pp. 64-71, 2020.
  27. [27] A. Luthfi, B. Distiawan, and R. Manurung, “Building an Indonesian named entity recognizer using Wikipedia and DBPedia,” 2014 Int. Conf. on Asian Language Processing (IALP), pp. 19-22, 2014. https://doi.org/10.1109/IALP.2014.6973520
  28. [28] R. A. Leonandya, B. Distiawan, and N. H. Praptono, “A Semi-supervised Algorithm for Indonesian Named Entity Recognition,” 2015 3rd Int. Symp. on Computational and Business Intelligence (ISCBI), pp. 45-50, 2015. https://doi.org/10.1109/ISCBI.2015.15
  29. [29] I. Alfina, R. Manurung, and M. I. Fanany, “DBpedia entities expansion in automatically building dataset for Indonesian NER,” 2016 Int. Conf. on Advanced Computer Science and Information Systems (ICACSIS), pp. 335-340, 2016. https://doi.org/10.1109/ICACSIS.2016.7872784
  30. [30] Y. Fu, N. Lin, X. Lin, and S. Jiang, “Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition,” J. Intell. Fuzzy Syst., Vol.41, No.1, pp. 563-574, 2021. https://doi.org/10.3233/JIFS-202286
  31. [31] K. Kurniawan and S. Louvan, “Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts,” Proc. of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 85-92, 2018.
  32. [32] W. Suwarningsih, I. Supriana, and A. Purwarianti, “ImNER Indonesian medical named entity recognition,” 2014 2nd Int. Conf. on Technology, Informatics, Management, Engineering & Environment, pp. 184-188, 2014. https://doi.org/10.1109/TIME-E.2014.7011615
  33. [33] M. Jalili Sabet, P. Dufter, F. Yvon, and H. Schütze, “SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings,” Findings of the Association for Computational Linguistics (EMNLP 2020), pp. 1627-1643, 2020.
  34. [34] L. Cui, Y. Wu, J. Liu, S. Yang, and Y. Zhang, “Template-Based Named Entity Recognition Using BART,” Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021), pp. 1835-1845, 2021.
  35. [35] X. Li, J. Feng, Y. Meng, Q. Han, F. Wu, and J. Li, “A Unified MRC Framework for Named Entity Recognition,” Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5849-5859, 2020.
  36. [36] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin, “Beyond English-Centric Multilingual Machine Translation,” J. Mach. Learn. Res., Vol.22, No.1, pp. 4839-4886, 2021.
  37. [37] R. Zhou, X. Li, R. He, L. Bing, E. Cambria, L. Si, and C. Miao, “MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER,” Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, Vol.1: Long Papers, pp. 2251-2262, 2022.
  38. [38] S. Ghosh, U. Tyagi, M. Suri, S. Kumar, R. S, and D. Manocha, “ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER,” Proc. of the 61st Annual Meeting of the Association for Computational Linguistics, Vol.1: Long Papers, pp. 104-125, 2023.
  39. [39] Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF Models for Sequence Tagging,” arXiv e-prints, arXiv:1508.01991, 2015. https://doi.org/10.48550/arXiv.1508.01991
  40. [40] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural Architectures for Named Entity Recognition,” Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260-270, 2016.
  41. [41] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual String Embeddings for Sequence Labeling,” Proc. of the 27th Int. Conf. on Computational Linguistics, pp. 1638-1649, 2018.
  42. [42] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1: Long and Short Papers, pp. 4171-4186, 2019.
  43. [43] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised Cross-lingual Representation Learning at Scale,” Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440-8451, 2020.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Jan. 08, 2025