JACIII Vol.14 No.5 pp. 531-539
doi: 10.20965/jaciii.2010.p0531


A Signal-Representation-Based Parser to Extract Text-Based Information from the Web

Mu-Chun Su*1, Shao-Jui Wang*2, Chen-Ko Huang*3,
Pa-ChunWang*4, *5, Fu-Hau Hsu*1, Shih-Chieh Lin*1,
and Yi-Zeng Hsieh*1

*1Department of Computer Science & Information Engineering, National Central University, Taiwan

*2Chunghwa Telecom Co., Ltd., Taiwan


*4Quality Management Center, Cathay General Hospital, Taiwan, R.O.C.

*5School of Medicine, Fu Jen Catholic University, Taiwan

October 3, 2009
April 27, 2010
July 20, 2010
information extraction, wrapper, parser, Web, template matching
Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a signal-representation-based parser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach.
Cite this article as:
M. Su, S. Wang, C. Huang, Pa-ChunWang, F. Hsu, S. Lin, and Y. Hsieh, “A Signal-Representation-Based Parser to Extract Text-Based Information from the Web,” J. Adv. Comput. Intell. Intell. Inform., Vol.14 No.5, pp. 531-539, 2010.
Data files:
  1. [1] B. Adelberg, “NoDoSE: A tool for semi-automatically extracting structured and semi-structured data from text documents,” in Proc. of the 1998 ACM SIGMOD Int. Conf. on Management of Data, pp. 283-294, Seattle, Washington, June 1998.
  2. [2] N. Ashish and C. Knoblock, “Semi-automatic wrapper generation for internet information sources,” in Proc. of the Second IFCIS Int. Conf. on Cooperative Information Systems, pp. 160-169, Kiawah Island, SC, June 1997.
  3. [3] N. Ashish and C. Knoblock, “Wrapper generation for semistructured internet sources,” SIGMOD Record, Vol.26, No.4, pp. 8-15, December 1997.
  4. [4] P. Atzeni and G. Mecca, “Cut and paste,” in Proc. of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 144-153, Tucson, Arizona, May 1997.
  5. [5] R. B. Doorenbos, O. Etzioni, and D. S. Weld, “A scalable comparison-shopping agent for the world-wide web,” in Proc. of the First Int. Conf. on Autonomous Agents, pp. 39-48, California, February 1997.
  6. [6] D. Embley, D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass, “A conceptual-modeling approach to extracting data from the web,” in Proc. of the 17th Int. Conf. on Conceptual Modeling (ER’98), pp. 78-91, Singapore, November 1998.
  7. [7] A. Gupta, V. Harinarayan, and A. Rajaraman, “Virtual database technology,” SIGMOD Record, Vol.26, No.4, pp. 57-61, December 1997.
  8. [8] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extracting semi-structured information from the web,” in Proc. of the Workshop on Management of Semi-structured Data, Tucson, Arizona, pp. 18-25, May 1997.
  9. [9] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper induction for information extraction,” in Proc. of the 1997 Int. Joint Conf. on Artificial Intelligence, pp. 729-735, 1997.
  10. [10] I. Muslea, S. Minton, and C. Knoblock, “STAKLER: learning extraction rules for semi-structured, web-based information sources,” in Proc. of AAAI’98 Workshop on AI and Information Integration, pp. 74-81, Madison, Wisconsin, July 1998.
  11. [11] S. Soderland, “Learning to extract text-based information from the world wide web,” in Proc. of the Third Int. Conf. on Knowledge Discovery and Data Mining, pp. 251-254, California, August 1997.
  12. [12] C. N. Hsu and M. T. Dung, “Generating finite-state transducers for semi-structured data,” Information Systems, Vol.23, No.8, pp. 521-537, Aug. 1998.
  13. [13] A. Sahuguet and F. Azavant, “Building intelligent web applications using lightweight wrappers,” Data and Knowledge Engineering, Vol.36, No.3, pp. 283-316, 2001.
  14. [14] R. Baumgartner, S. Flesca, and G. Gottlob, “Supervised wrapper generation with Lixto,” VLDB J., pp. 715-716, 2001.
  15. [15] C. H. Chang and S. C. Lui, “IEPAD: Information extraction based on pattern discovery,” in Proc. of the 10th Int. Conf. on World Wide Web, pp. 681-688, Hong-Kong, May 2-6, 2001.
  16. [16] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully automatic wrapper generation for search engines,” in Proc. of the 14th Int. Conf. on World Wide Web Conference, pp. 66-75, 2005.
  17. [17] N. K. Papadakis, D. Skoutas, K. Raftopoulos, and T. A. Varvarigou, “STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques,” IEEE Trans. on Knowledge and Data Engineering, Vol.17, No.12, pp. 1638-1652, 2005.
  18. [18] D. W. Embley, Y. Jiang, and Y. K. Ng, “Record-boundary discovery in Web documents,” in Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), pp. 467-478, Philadelphia, PA, 1999.
  19. [19] G. Carpenter and S. Grossberg, “Adaptive resonance theory: stable self-organization of neural recognition codes in response to arbitrary lists of input patterns,” in Proc. of the 8th Conf. of the Cognitive Science Society, pp. 45-62, 1986.
  20. [20] A Repository of Online Information Sources Used in Information Extraction Tasks,
  21. [21] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” in Proc. of the Third Int. Conf.on Autonomous Agents, pp. 190-197, 1999.
  22. [22] C. N. Hsu and C. C. Chang, “Finite-state transducers for semistructured text mining,” in Proc. of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Jul. 12, 2024