Data
I provide data and code so that others can reproduce or compare against my results. Email me if you are looking for something you don’t see below.
Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations
This dataset contains tweets related to COVID-19. The dataset contains Twitter ids, from which you can download the original data directly from Twitter. Additionally, we include the date, keywords related to COVID-19 and the inferred geolocation. Check detailed information at http://twitterdata.covid19dataresources.org/index.
Demographics Race/Ethnicity Training Data
Our paper, “Using Noisy Self-Reports to Predict Twitter User Demographics,” produced a dataset of Twitter users whose profile descriptions may self-report their race or ethnicity. We used this dataset to train classifiers for these demographic labels, and showed that models trained on the collected data perform better on gold standard survey data than models trained only on crowd-sourced data. We distribute the trained models.
Civil Unrest on Twitter (CUT)
Tweets labeled with information related to protest, riots and civil unrest on Twitter. Based on Justin Sech, Alexandra DeLucia, Anna L Buczak, Mark Dredze. Civil Unrest on Twitter (CUT): A Dataset of Tweets to Support Research on Civil Unrest. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2020
Named Entity Recognition for Chinese Social Media (Weibo)
This dataset contains messages selected from Weibo and annotated according to the DEFT ERE annotation guidelines. Annotations include both name and nominal mentions. The corpus contains 1,890 messages sampled from Weibo between November 2013 and December 2014.
Vaccine Related English Tweets from the United States: 2016 to 2018
A list of tweet ids sampled from the 1% feed that contain vaccine related keywords. All tweets are in English (according to the Twitter provided metadata) and have been geolocated to the United States (using Carmen).
Annotated Gun Control/Rights Tweets
This dataset contains 50k (automatically) annotated tweets about gun control and gun rights. It was used in our paper: Adrian Benton, Mark Dredze. Using Author Embeddings to Improve Tweet Stance Classification. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2018.
Annotations for "Weaponized Health Communication"
This contains 10k tweet annotations for our paper: Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate
CLPsych Shared Task
The Computational Linguistics and Clinical Psychology (CLPsych) workshop has hosted shared and unshared tasks for several years. In 2015 the shared task used data from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) along with demographically-matched community controls. The shared task provided an apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD.
RateMD Dataset
This dataset contains the reviews from RateMD with aspect ratings that was used in this paper:
Byron C. Wallace, Michael J. Paul, Urmimala Sarkar, Thomas A. Trikalinos, Mark Dredze. A Large-Scale Quantitative Analysis of Latent Factors and Sentiment in Online Doctor Reviews. Journal of the American Medical Informatics Association (JAMIA), 2014;21(6):1098--1103.
Zika Conspiracy Tweets
This dataset contains annotations for whether a tweet about Zika contains pseudo-scientific information. Analysis of this dataset was published in:
Mark Dredze, David A Broniatowski, Karen M Hilyard. Zika Vaccine Misconceptions: A social media analysis. Vaccine, 2016.
Vaccination Sentiment and Relevance Tweets
This dataset contains annotations for whether a tweet is relevant to the topic of vaccinations, and if the author is expressing a positive or negative view about vaccines. Analysis of this dataset was published in:
Michael Smith, David A. Broniatowski, Mark Dredze. Using Twitter to Examine Social Rationales for Vaccine Refusal. International Engineering Systems Symposium (CESUN), 2016.
Mark Dredze, David A. Broniatowski, Michael Smith, Karen M. Hilyard. Understanding Vaccine Refusal: Why We Need Social Media Now. American Journal of Preventive Medicine, 2015.
Flu Vaccination Tweets
This dataset contains annotations for whether a tweet is relevant to the topic of flu vaccination, and if the author intends to receive a flu vaccine. Analysis of this dataset was published in:
Xiaolei Huang, Michael C. Smith, Michael Paul, Dmytro Ryzhkov, Sandra Quinn, David Broniatowski, Mark Dredze. Examining Patterns of Influenza Vaccination in Social Media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), 2017.
Named Entity Recognition and Entity Linking for Speech
This corpus contains broadcast news transcripts annotated for named entities and entity linking against the TAC KBP 2009 corpus. This was used in our NAACL 2015 paper "Entity Linking for Spoken Language" and in our 2011 Interspeech paper:
Carolina Parada, Mark Dredze, Frederick Jelinek. OOV Sensitive Named-Entity Recognition in Speech. International Speech Communication Association (INTERSPEECH), 2011.
Twitter Grammy XDoc Corpus: Entity Linking and Disambiguation
This corpus contains tweets about the Grammy Award ceromony annotated for entity linking and cross document coreference resolution (entity disambigutation). The corpus is described in our paper:
Mark Dredze, Nicholas Andrews, Jay DeYoung. Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation. EMNLP Workshop on Natural Language Processing for Social Media, 2016.
Twitter Health Keywords
These files contain the keywords we use to collect and identify health related tweets.
Health Twitter Annotations
These annotations were created for the paper:
Michael J. Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011.
The annotations label tweets as they relate to health. The annotations are described on page 2 of the paper. The file includes tweet ids which you can use to download the data.
Influenza Twitter Annotations
These annotations were created for the paper:
Alex Lamb, Michael J. Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
The annotations label tweets are related to influenza, awareness vs. infection and if the tweet is about the author or someone else. The files include tweet ids which you can use to download the data.
Twitter Named Entity Recognition Dataset
A collection of tweets tagged for named entities. These were created as described in this paper. Thanks to Dirk Hovy for preparing the data as part of his LREC 2014 paper, which contains a larger collection of Twitter NER data.
Twitter Hurricane Sandy Dataset
A collection of tweets from areas hit by hurricane Sandy (2012) in the United States. This dataset is meant for research in social media disaster response.
Haoyu Wang, Eduard Hovy, Mark Dredze. The Hurricane Sandy Twitter Corpus. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015.
Twitter First Name, Last Name, and Location Clusters
A set of clusters extracted from Twitter that contains firstnames, lastnames, and locations. We used this in our NAACL 2013 paper:
Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
Enron Attachment Prediction Email
Enron emails annotated with attachment information and cleaned of numerous artifacts inserted by email programs. Unfortunately, I don't have a copy of the attachments. Very few groups had that data, and I am not aware of anyone who currently has a copy. Email me for the data.
Multi-Domain Sentiment Dataset
Product reviews from several different product types taken from Amazon.com. This dataset is from:
John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association for Computational Linguistics (ACL), 2007.
Image Spam Dataset
A collection of ham and spam images taken from real user email. This dataset is from:
Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach. Learning Fast Classifiers for Image Spam. Conference on Email and Anti-Spam (CEAS), 2007.
TAC 2009 Entity Linking
A collection of manually linked training examples to supplement those provided in the TAC 2009 KBP task. These are described in my Coling 2010 paper on entity linking. If you use this data, please cite:
Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, Tim Finin. Entity Disambiguation for Knowledge Base Population. Conference on Computational Linguistics (Coling), 2010.