Home | University of Tehran | Faculty of Engineering | ECE Department | CIPCE | Contact Us
DataBase Research Group - University of Tehran


Last Updated on: 21/07/07 
Data Mining
Title: XML Mining by frequent tree patterns

Work by: Mostafa Haghir Chehrehghani
Supervisor: Dr. Masoud Rahgozar
Advisor: Dr. Caro Lucas
Team Members: None
Problem Definition, goal and Importance: [persian]

Most of data do not have any specific form and not a fixed form at all. We call the banks which contain such data semi-structured informative banks. There exists no fixed schema for these information and data is stored in a graph like structure which has information about the data and data itself. Like xml.

Xml is a marking language derived from the general SGML language which uses hierarchical structure to store data. Users can define elements and their properties as they need them and customize the data format.
A large amount of data existing in the web is xml and is used widely in all the fields of web programming. Ability of extracting knowledge from xml data has become more important with the growth of usage of xml [6].

Because that in informative banks all the data have same structure, the data structure play no role in data mining. But in xml informative banks structure is a big part of the concept. So investigating the models in data structures like frequent sub-trees, plays an important role in data mining. We can imagine this like finding association rules in the general case [9].

Frequent trees, give us a big picture of data existing in the bank and is a way for summarizing the information in the bank. We can use these structures for formulating the questions and answers, indexes, clustering, and etc. Finding frequent structures because of the uncompleted structure of xml is not easy. Because xml files can have same data with different structures even the time the DTD is used.

Approach:

In this research we try to distinguish the points of weakness in existing algorithms and recommend solutions for them. At last implement it and compare the experimental results with other algorithms.

Research Prerequisites:

In [1] it uses tree edit distance for measuring the similarity between xml files without mentioning the term frequent trees. Their algorithm is based on the minimum operations needed for transforming one tree to another.
[3] Recommends that xml file be presented with a tag tree pattern and then try to find these patterns as much as possible. This way of presentations takes a lot of space for large xml files. In [4] lee defined the similarity as the common paths, but this definitions is not precise. Although [5] has a strong base of theory, but it seems impossible to implement. In [6] Mr. Ling presents a model for finding the association rules of xml. In [7] Mr. Zaki has presented an algorithm named tree miner which finds the frequent patterns in a data structure called scope-list. This one is one of the most effective algorithms for finding frequent patterns semi-structured data which can be expanded to trees. In [8] a way is recommended for solving the problem of Apriori. And recently [9] showed that when there are a lot of association rules to be found, XAR-Miner method works better that Mine Rule.

[1] C.H. Moh, E.P. Lim and W.K. Ng, DTD-Miner: A tool for mining DTD from XML documents, Proceedings of the 2nd Int. Workshop on Advance Issues of E-Commerce and Web-Based Information Systems, Milpitas, California, pp.144-151, June, 2000.

[2] J.W. Lee, K. Lee and W. Kim, Preparations for semantics-based XML mining, Proceedings of the 2001 IEEE International Conference on Data Mining, pp.345-352, San Jose, California, December, 2001.

[3] T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi and H. Ueda, Discovery of frequent tree structured patterns in semistructuredWeb documents, Proceedings of the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hong Kong, China, pp.47-52, April 2001.

[4] S. Nestorov, S. Abiteboul and R. Motwani, Extracting schema from semi-structured data, Proceedings of ACM SIGMOD International Conference on Management of Data, Seattle, Washington, pp.295-306, June 1998.

[5] Feng, L., Dillon, T. S., Weigand, H., Chang, E., An XML-Enabled Association Rule Framework. In Proceedings of DEXA 2003, pp 88-97, Prague, Czech Republic, 2003.

[6] Feng, L. & Dillon, T. S., Mining XML-Enabled Association Rule with Templates. In Proceedings of KDID 04, 2004.

[7] Zaki, M. J., Efficient Mining of Trees in the Forest. SIGKDD '02, Edmonton, Alberta, Canada, ACM, 2002.

[8] Park, J. S., Chen, M.-S., et al., Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Transactions on Knowledge and Data Engineering 9(5): 813-825, 1997.

[9] Zhang, J., et al. On Efficient and Effective Association Rule Mining from XML Data. In DEXA 2004. 2004. Zaragosa, Spain: LNCS.

Reports and Seminars:

1. presentation 1 [Persian]

2. presentation 2 [English]

3. presentation 3 [Persian]

Related Conferences:

VLDB

IEEE International Conference on Data Mining (ICDM)

Knowledge Discovery and Data Mining (KDD)

Related Links:

1. Education in Data Mining and Knowledge Discovery

2. XML Mining

Publications:

Incremental Mining of Frequent XML Query Patterns

Efficiently Mining Frequent Trees in a Forest

Fast mining of frequent tree structures by hashing and indexing

Frequent Free Tree Discovery in Graph Data

Mining XML-Enabled Association Rules with templates

X3-Miner: mining patterns from XML database


Copyright 2007 DBRG-UT. All rights reserved. Designed by Aresh Dadlani, Mohamad Hasan Ahmadi