Introduction
Text collections are essential for research in different fields like: information retrieval, computational linguistics and natural language processing. Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems. For more information please visit Persian@CLEF2008 and Persian@CLEF2009.
We have prepared this collection to meet the following specifications:
-
It's large enough, so that it can be a reliable representative for Persian text and can be used in different experiments.
-
It contains queries and their relevance judgments for evaluation of different systems.
Versions
Documents of the collection are prepared by crawling, preprocessing and tagging of web site of the Hamshahri news paper. Version 1 of the collection was used at CLEF 2008 and 2009 for evaluation of Ad Hoc information retrieval systems. Version 2 of the collection is twice in size compared to the previous version .
- Version2 of the collection contains a topic set of 50 queries and their judgment that was created by 25 different users during summer 2009. Topics of this version of Hamshahri were created using the University of Tehran Information Retrieval Evaluation system (UTIRE).
Specifications
The following table summarizes specifications of the two versions of Hamshahri collection:
Criteria |
Version 1 |
Version 2 |
Size (Unicode CLEF XML Format) |
700 MB |
1400 MB |
Number of Documents |
160,000 |
318,000 |
Documents Time Span |
From |
1996/4/23 |
1996/4/23 |
To |
2003/2/11 |
2007/5/13 |
Documents Category |
Yes |
Yes |
Link to Images |
No |
Yes |
| Link to Original Webpages |
No |
Yes |
Query + Relevance Judgments |
Yes |
Yes |
Comparison of Versions 1&2
-
Version 2 is more structured.
-
Size and number of documents of version 2 is twice more than version 1.
-
Version 2 contains links to the original webpages (ORIGINALFILE XML tag). This feature enables you to download the original webpages and process them as however you like.
-
Version 2 contains images that were used in the original webpages. All 148,639 Images are downloadable in form of a package named 'Ham2-IMG' that is 1.9 GB. This feature, makes version 2 of the collection suitable for some other tasks like image retrieval.
Please note that only version 1 of the collection was used at the CLEF campaign.
Applications
Hamshahri collection can be used for different purposes like:
Studying different features of information retrieval algorithms like indexers and retrieval models.
-
Analyzing the Persian language and its features
-
Persian clustering and classification: All documents of the collection contain a 'Cat' tag that specifies thier category. (Totally 9 main categories and 36 subcategories like 'Economic.Bourse')
-
Other algorithms like Persian stemming: this type of algorithms are important that are used in different fields like information retrieval, spell checking, machine translation, Etc. Documents of the collection are prepared from articles of the Hamshahri newspaper and contains no spelling errors that makes it suitable for applications like stemming and analysis of the Persian language. For example one can use it to create a statistical Persian stemmer.
Copyright
Hamshahri corpus is created at DBRG Lab. of the University of Tehran – ECE department. All rights of the corpus' news are reserved for Hamshahri newspaper. All rights of the corpus' data and the tools that are included in this website are reserved for University of Tehran - Database Research Group. Usage of this package for any research or non-commercial purposes is free with the precondition that you cite paper number [1] of publications section.