JACIII Vol.27 No.3 pp. 394-403
doi: 10.20965/jaciii.2023.p0394

Research Paper:

Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique

Fumihiro Sakahira*,† ORCID Icon, Yuji Yamaguchi** ORCID Icon, and Takao Terano***

*Faculty of Information Science and Technology, Osaka Institute of Technology
1-79-1 Kitayama, Hirakata-City, Osaka 573-0196, Japan

Corresponding author

**Research Institute for the Dynamics of Civilizations, Okayama University
3-1-1 Tsushimanaka, Kita-Ku, Okayama-City, Okayama 700-8530, Japan

***Platform for Arts and Science, Chiba University of Commerce
1-3-1 Konodai, Ichikawa-City, Chiba 272-8512, Japan

November 18, 2022
January 13, 2023
May 20, 2023
archaeological sites, cultural similarity, excavation reports, natural language processing, sentence embedding

In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.

Cite this article as:
F. Sakahira, Y. Yamaguchi, and T. Terano, “Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique,” J. Adv. Comput. Intell. Intell. Inform., Vol.27 No.3, pp. 394-403, 2023.
Data files:
