Applying genetic algorithms to query optimization in document retrieval

Aus de_evolutionary_art_org
Wechseln zu: Navigation, Suche

Referenz

J. Horng, C. Yeh: Applying genetic algorithms to query optimization in document retrieval. Information Processing and Management, 36 (2000), pp. 737–759

DOI

http://dx.doi.org/10.1016/S0306-4573(00)00008-X

Abstract

This paper proposes a novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights. One of the contributions of the paper is to combine the Bigram (Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. 1997. Chinese text retrieval without using a dictionary, ACM SIGIR’97, Philadelphia, PA, USA, pp. 42–49; Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. 1993), Document automatic classification and ranking, Master thesis, Department of Computer Science, National Tsing Hua University) model and PAT-tree structure (Chien, L.-F., Huang, T.-I., & Chien, M.-C. 1997 Pat-tree-based keyword extraction for Chinese information retrieval, ACM SIGIR’97, Philadelphia, PA, US, pp. 50–59) to retrieve keywords. The approach extracts bigrams from documents and uses the bigrams to construct a PAT-tree to retrieve keywords. The proposed approach can retrieve any type of keywords such as technical keywords and a person’s name. Effectiveness of the proposed approach is demonstrated by comparing how effective are the keywords found by both this approach and the PAT-tree based approach. This comparison reveals that our keyword retrieval approach is as accurate as the PAT-tree based approach, yet our approach is faster and uses less memory. The study then applies genetic algorithms to tune the weight of retrieved keywords. Moreover, several documents obtained from web sites are tested and experimental results are compared with those of other approaches, indicating that the proposed approach is highly promising for applications.

Extended Abstract

Bibtex

@article{Horng2000737,
title = "Applying genetic algorithms to query optimization in document retrieval ",
journal = "Information Processing & Management ",
volume = "36",
number = "5",
pages = "737 - 759",
year = "2000",
note = "",
issn = "0306-4573",
doi = "http://dx.doi.org/10.1016/S0306-4573(00)00008-X",
url = "http://www.sciencedirect.com/science/article/pii/S030645730000008X http://de.evo-art.org/index.php?title=Applying_genetic_algorithms_to_query_optimization_in_document_retrieval",
author = "Jorng-Tzong Horng and Ching-Chang Yeh",
keywords = "Information retrieval",
keywords = "Keyword",
keywords = "PAT-tree "
}

Used References

Allan, J. (1996). Incremental relevance feedback for information ®ltering. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland (pp. 270±278).

Baeza-Yates, R. A. (1992). Text retrieval: theory and practice. In International Federation for Information Processing Congress, Vol. 1, Madrid, Spain (pp. 465±476).

Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC 3. In Proceedings of the Third Text REtrieval Conference, Gaithersburg, Maryland (pp. 69±80).

Buckley, C., & Salton, G. (1995). Optimization of relevance feedback weights. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA (pp. 351±357).

Buckley, C., Singhal, A., & Mitra, M. (1995). New retrieval approaches using SMART: TREC 4. In Proceedings of the Fourth Text REtrieval Conference, Gaithersburg, Maryland (pp. 25±48).

Chang, C.-H., & Hsu, C.-C. (1997). Information searching and exploring agent applying clustering and genetic algorithm. In the 1st Agent Technology Workshop, Taipei, Taiwan.

Chang, C.-H., & Hsu, C.-C. (1999). The design of an information system for hypertext retrieval and automatic discovery on WWW. Ph.D. thesis, Department of CSIE, National Taiwan University.

Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. (1997). Chinese text retrieval without using a dictionary. ACM SIGIR'97, Philadelphia, PA, USA, pp. 42±49.

Chien, L.-F., Huang, T.-I., & Chien, M.-C. (1997). Pat-tree-based keyword extraction for Chinese information retrieval. ACM SIGIR'97, Philadelphia, PA, USA pp. 50±59.

Fung, P., & Wu, D. (1994). Statistical augmentation of a Chinese machine-readable dictionary. In Second Annual Workshop on Very Large Corpora, pp. 33±56.

Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Reading, MA: Addison- Wesley.

Gonnet, G. H., Baeza-Yates, R. A., & Snider, T. (1992). New indices for text PAT trees and PAT arrays. In Information retrieval data structures and algorithms (pp. 66±82). Prentice Hall.

Gordon, M. D. (1988). Probabilistic and genetic algorithms in document retrieval. Communications of the ACM, 31, 1208±1218.

Harman, D. (1992). Relevance feedback revisited. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark (pp. 1±10).

Huang, M.-H. (1997). The development of indexing system evaluation Ð theory and practice. Chinese Library Society Report, 59, 109±126.

He, J., Xu, J., Chen, A., Meggs, J., & Gey, F. C. (1996). Berkeley Chinese information retrieval at TREC-5: technical report. In Proceeding of the Fifth Text REtrieval Conference, Gaithersburg, Maryland (pp. 191±198).

Jones, K. S. (1972). A statistical interpretation of term speci®city and its application in retrieval. J. Documentation, 28(1), 11±20.

Kwok, K. L. (1997). Comparing representations in Chinese information retrieval. ACM SIGIR'97, Philadelphia, PA, USA, pp. 34±41.

Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training Algorithms for Linear Text Classi®ers. ACM SIGIR'96, Zurich, Switzerland, pp. 298±306.

Liddy, E. D., Paik, W., & Yu, E. S. (1994). Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transaction on Information Systems, 12(3), 278±295.

Liu, Y. (1987). New advances in computers and natural language processing in China. Information Science, 8, 64± 70.

Lochbaum, K. E., & Streeter, L. A. (1989). Comparing and combining the e€ectiveness of latent semantic indexing and the ordinary vector space model for information retrieval. Journal of Information Science, 6, 25±33.

Lundquist, D., Grossman, D. A., & Frieder, O. (1997). Improving relevance feedback in the vector space model. In Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada (pp. 16±23).

Manber, U., & Baeza-Yates, R. (1991). An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 37, 133±136.

Nie, J.-Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. ACM SIGIR'96, Zurich, Switzerland, pp. 225±233.

Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton, The SMART retrieval system: Experiments in automatic document processing (pp. 313±323). Englewood Cli€s, NJ: Prentice-Hall.

Salton, G., & MacGill, M. (1983). Introduciton to modern information retrieval. New York: McGraw-Hill.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513±523.

Tate, D. M., & Smith, A. E. (1995). A genetic approach to the quardatic assignment problem. Computers and Operations Research, 22(1), 73±83.

Yang, Y., & Chute, C. G. (1994). An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3), 252±277.

Yang, J.-J., & Korfhage, R. R. (1993). Query optimization in information retrieval using genetic algorithms. In Proceedings of the Fifth International Conference on Genetic Algorithms, Urbana, IL (pp. 603±611).

Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. (1993). Document automatic classi®cation and ranking. Master thesis, Department of Computer Science, National Tsing Hua University.

Zhai, C., Tong, X., Milic-Frayling, N., & Evans, D. A. (1996). Experiments on Chinese text indexing Ð CLARIT TREC-5 Chinese track report. In Proceedings of the Fifth Text Retrieval Conference, Gaithersburg, Maryland (pp. 335±340).

Links

Full Text

internal file


Sonstige Links