Note:
Two different versions of the Hamshahri collection are provided:
- Version 1 of Hamshahri has two subversions:
- Version 1 of CLEF that was prepared in years 2008-2009 and many systems were evaluated based on this version at CLEF 2008 and 2009 campaigns. This version contains 100 topics with relevance judgment and 160,000+ documents all in XML format. If you want to use version 1 of Hamshahri, we recommend you to use this version. This Version is also downloadable here.
- Version 1 of DBRG that was prepared in year 2007. Documents of this version are exactly the same as the CLEF version but are stored in TREC format. This version contains 65 topics with relevance judgment that are different from the CLEF version topics. This version of Hamshahri is not available online, if you need it please contact us.
-
Version 2 of the Hamshahri collection was prepared in year 2009 using the UTIRE system at database research group of the University of Tehran based on CLEF standards. This version of the collection contains 320,000+ documents and 50 topics with their relevance judgment. Version 2 is also downloadable here.
Hamshahri Version 1 - CLEF 2008 & 2009 version
Note:
In order to use Hamshahri version 1, you need some passwords. You are required to fill out this copyright form and mail it to a.aleahmad(at)ece.ut.ac.ir, then we will provide you with the passwords. Also, we will keep you updated about latest changes of this version.
Item |
Size |
Description |
Download |
| Documents in XML + DTD |
160 MB |
This file contains all Hamshahri 1 documents in XML format. Each XML document is tagged with a category and contains news articles of one day from the Hamshahri newspaper. File names have a pattern like "HAM1-YYMMDD" in which YYMMDD shows year, month and day of the documents in the file. Jalali date of the documents are tagged in the XML file itself. |
 |
Sample |
34 KB |
This is a sample from Hamshahri 1 documents |
 |
| Categories |
15.2 KB |
This file contains all categories that the collection documents are tagged with |
 |
| Topic Set1 |
176 KB |
This file contains 50 topics created for Hamshahri 1 at CLEF 2008 |
 |
| Topic Set2 |
156 KB |
This file contains 50 topics created for Hamshahri 1 at CLEF 2009 |
 |
| Persian Stopwprds |
9.6 KB |
This is a list of 800+ stopwords that are extracted from Hamshahri 1 corpus |
|
Hamshahri Version 2
Note:
In order to use Hamshahri version 2, you need some passwords. You are required to fill out this copyright form and mail it to a.aleahmad(at)ece.ut.ac.ir, then we will provide you with the passwords.
Item |
Size |
Description |
Download |
Documents in XML+DTD |
399 MB |
Contains documents of Hamshahri 2 in CLEF format. Each XML document is tagged with a category and contains news articles of one day from the Hamshahri newspaper. File names have a pattern like "HAM2-YYMMDD" in which YYMMDD shows year, month and day of the documents in the file. Jalali date of the documents are tagged in the XML file itself. |
 |
HAM2-IMG Package |
1.93 GB |
This package contains the images used within the collection documents. Image paths are tagged in the collection documents. So, if you need the images, you should download HAM2-IMG package |
 |
| Help |
221 KB |
User manual of the collection. It describes XML tags used and document categories |
 |
| A Sample Document |
139 KB |
You can take a look at a sample document of the collection in XML format. Also DTD of the collection can be viewed here. |
 |
| Document Categories |
15.2 KB |
This file contains all categories that Hamshahri2 documents are tagged with |
 |
Topics |
7.71 KB |
Contains 50 queries that were created using UTIRE. Topics are all presented in both English and Persian. |
 |
| Relevance Judgment |
485 KB |
Contains relevance judgments of the 50 above topics |
 |
| Persian Words |
1.43 MB |
List of all Persian words of the collection with their frequencies |
 |
Documents in Text Format |
295 MB |
Contains Hamshahri2 documents in pure text format stored in Windows-1256 encoding. If you need any other encoding format, please use codepage converter software that is also downloadable here |
 |
| XML to Text Converter |
3.53 KB |
This software converts Hamshahri 2 XML documents to pure text encoded in Unicode, Windows-1256 or UTF-8. In addition to the Hamshahri 2 collection, you should have .Net Framework 3.5 or higher installed to run the program. |
 |