-------------------------------------------------------- README for the Arabic Online Commentary Dataset v1.1 -------------------------------------------------------- 1. Introduction ----------------- The AOC dataset was created by crawling the websites of three Arabic newspapers, and extracting online articles and readers' comments. The readers' comments are arguably more "interesting", which is why we call this the *commentary* dataset, but the articles themselves are also included. 2. Sources ------------ The extraction crawled webpages corresponding to a roughly-6-month period, covering early April 2010 to early October 2010. The three newspapers are: 1) Al-Ghad (الغد), a Jordanian newspaper (www.alghad.com) Crawled URLs: http://www.alghad.com/?news=500000 through http://www.alghad.com/?news=534000 (each page contains both the article and the corresponding comments) 2) Al-Riyadh (الرياض), a Saudi newspaper (www.alriyadh.com) Crawled URLs: http://www.alriyadh.com/article520000.html through http://www.alriyadh.com/article565000.html for the articles, and http://www.alriyadh.com/newspaper/comments/520000/1 through http://www.alriyadh.com/newspaper/comments/565000/1 for the corresponding comments 3) Al-Youm Al-Sabe' (اليوم السابع), an Egyptian newspaper (www.youm7.com) Crawled URLs: http://www.youm7.com/News.asp?NewsID=210000 through http://www.youm7.com/News.asp?NewsID=285000 for the articles, and http://www.youm7.com/Includes/NewsComments.asp?NewsID=210000&page=[1|2|...] through http://www.youm7.com/Includes/NewsComments.asp?NewsID=210000&page=[1|2|...] for the corresponding comments For alriyadh.com, around 7% of the article URLs redirect to the "Net" edition of Al-Riyadh. For example http://www.alriyadh.com/article520116.html ...redirects to: http://www.alriyadh.com/net/article/520116 Those articles seem to be only published online, and not in the paper edition, as they do not have an issue number (which the other articles do), and they have a timestamp (which the other articles do not). 2. Organization ---------------- The extracted comments were split into segments based on hard returns entered by the author, as indicated by the HTML
tag. The extracted articles were split into segments based on hard returns indicated by
and paragraph breaks indicates by

(or

). No further punctuation-based segmentation was performed, though it is perfectly reasonable for you to do so. The dataset contains six XML files, two per newspaper (one for the comments and one for the articles). The XML files contain the segments themselves, in addition to some other relevant information for each comment or article, stored as XML fields explained in the next section. 3. XML Fields -------------- In the comment XML files (AOC_*_comments.xml), there are 4 fields that apply to every comment: (*) articleURL: The URL of the newspaper article (NOT the comments page). (*) date: The date on which the comment was posted, formatted dd/mm/yyyy. (*) time: The time at which the comment was posted, formatted hh:mm (or hh:mm:ss for alghad.com), following a 24-hour format (i.e. hh is between 00 and 23). (*) author: The author "ID" associated with that comment, as entered by the author. ...and there are 3 fields that apply to some but not all comments: (*) subtitle: a header entered by the author for their comment. (Only for alghad.com and youm7.com comments.) (*) authorEmail: an e-mail address entered by the author. (Only for alghad.com comments.) (*) authorLocation: a location entered by the author. (Only for alghad.com comments.) In the article XML files (AOC_*_articles.xml), there are 3 fields that apply to every article: (*) articleURL: Same as above. (*) date: The date the article was published, formatted dd/mm/yyyy. (*) htmlTitle: The string encapsulated within the HTML span. ...and there are 2 fields that apply to some but not all articles: (*) time: The time at which the article was published, formatted hh:mm (or hh:mm:ss for the alriyadh.com "Net" articles), following a 24-hour format (i.e. hh is between 00 and 23). (Only for alghad.com, youm7.com, and the alriyadh.com "Net" articles.) (*) issue: The issue number where the article was published. (Only for the alriyadh.com non-"Net" articles.) 4. The Datasets ---------------- The webpages were downloaded by supplying a URL list to the wget command. Not every URL has an article on it (though 97% of them do), and not every article has reader comments. Here is a breakdown of those quantities: ----------------------------------------------------------------------------------- Source | Al-Ghad Al-Riyadh Al-Youm Al-Sabe' ALL ----------------------------------------------------------------------------------- # URLs crawled | 34,001 45,001 75,001 154.0K files # URLs w/article | 32,223 43,506 73,798 149.5K files # URLs w/comment | 6,299 34,163 45,667 86.1K files ----------------------------------------------------------------------------------- % URLs w/article | 94.8% 96.7% 98.4% 97.1% % arts w/comment | 19.5% 78.5% 61.9% 57.6% ----------------------------------------------------------------------------------- Note that the release does not include the downloaded HTML files themselves, because the total size of these files is quite large (about 1 GB compressed). But if for some reason you are interested in the raw files, let me know. The commentary data consists of 3.1M segments, corresponding to 52.1M words (word: longest sequence of non-space characters). Here is the breakdown of the 86.1K articles across the three sources: ------------------------------------------------------------------------------------------------ Source | Al-Ghad Al-Riyadh Al-Youm Al-Sabe' TOTAL ------------------------------------------------------------------------------------------------ # arts w/comment | 6,299 34,163 45,667 86.1K articles # comments | 26,648 804,968 564,853 1.4M comments # segments | 63,304 1,685,533 1,383,952 3.1M segments # words | 1,235,300 18,782,395 32,132,157 52.1M words # characters | 6,878,512 104,231,502 177,604,767 288.7M characters XML file size | 19.7 MB 340.0 MB 446.3 MB 806.0 MB (195.0 MB zipped) ------------------------------------------------------------------------------------------------ comments/article | 4.23 23.56 12.37 16.21 segments/comment | 2.38 2.09 2.45 2.24 words/segment | 19.51 11.14 23.22 16.65 characters/word | 5.57 5.55 5.53 5.54 ------------------------------------------------------------------------------------------------ The article data consists of 1.4M segments, corresponding to 42.5M words. Note that a "segment" here is often a full paragraph, and further punctuation-based segmentation would result in a higher number of sentences (or true "segments" if you prefer to call it that). Here is the breakdown of the 149.5K articles across the three sources: ------------------------------------------------------------------------------------------------ Source | Al-Ghad Al-Riyadh Al-Youm Al-Sabe' TOTAL ------------------------------------------------------------------------------------------------ # articles | 32,223 43,506 73,798 149.5K articles # segments | 367,324 383,651 615,168 1.4M segments # words | 11,424,287 12,676,090 18,413,850 42.5M words # characters | 68,393,539 75,503,507 109,007,044 252.9M characters XML file size | 134.1 MB 150.9 MB 223.6 MB 508.6 MB (130.2 MB zipped) ------------------------------------------------------------------------------------------------ segments/article | 11.40 8.82 8.34 9.14 words/segment | 31.10 33.04 29.93 31.12 characters/word | 5.99 5.96 5.92 5.95 ------------------------------------------------------------------------------------------------ Keep in mind that many of the articles do not have any comments associated with them, since no reader comments were found when they were crawled. 5. History ----------- v1.1 (Nov. 29, 2010) added article data expanded readme updated sample XML file (no change at all to commentary data) v1.0 (Nov. 1, 2010) initial release of commentary data alone 6. Contact Me -------------- If you have any questions about the data, please contact me: ozaidan@cs.jhu.edu --O.Z.