EDNA-Covid Tweets Dataset

We have released a dataset of coronavirus related tweets in the EDNA-Covid Tweets dataset, described in our paper “Challenges and Opportunities in Rapid Epidemic Information Propagation with Live Knowledge Aggregation from Social Media” to appear in IEEE CogMi 2020.

We have provided the full list of Tweet-IDS at this Shared Directory, as well as in this GitHub Repository. It contains data from January 2020, with the following statistics:

Language Number of Tweets % Total
English 1,027,630,312 66.67%
Spanish 202,004,495 13.11%
Portuguese 70,372,055 4.57%
Indonesian 53,438,177 3.47%
French 50,354,530 3.27%

A sample of hydrated tweets from January and February, 2020, is provided in our public repository.

The Dataset contains a collection of Tweet IDs collected through a live Twitter streamer. We have described our streaming architecture in our paper “EDNA-COVID: A Large-Scale Covid-19 Dataset Collected with the EDNA Streaming Toolkit”, archived here. EDNA’s initial use was a test-bed for studying concept drift detection and recovery in multimedia datasets. Over time, it has grown to a toolkit for stream analytics for COVID. 

The Covid-19 pandemic has fundamentally altered many facets of our lives. With nationwide lockdowns and stay-at-home advisories, conversations about the pandemic have naturally moved to social networks, e.g. Twitter. This affords an unprecedented insight into the evolution of social discourse in the presence of a long-running destabilizing factor such as a pandemic with the high-volume, high-velocity, high-noise Covid-19 Twitter feed. However, real-time information extraction from such a data stream requires a fault-tolerant streaming infrastructure to perform the non-trivial integration of heterogenous data sources from news organizations, social feeds, and authoritative medical organizations like the CDC. To address this, we present (i) the EDNA streaming toolkit for consuming and processing streaming data, and (ii) EDNA-Covid, a multilingual, large-scale dataset of COVID-19 tweets collected with EDNA since January 25, 2020. EDNA-Covid includes, at time of this publication, over 1.5B tweets from around the world in over 10 languages. The TweetIDs dataset is ~32GB. The hydrated dataset derived from the TweetID dataset is >2TB.

The EDNA streaming toolkit is provided at this Github repository 


Data Collection Process

Our EDNA architecture is described in our EDNA-Covid paper archived here. We provide here a short description of the dynamic data collection process.

With EDNA-COVID, we present a dataset that exhibits concept drift. The online discourse on the Covid-19 pandemic has taken root in a dizzying array of online communities, such as sports, academia, and politics. This allows us a firsthand look at a real-world example of concept drift as the online conversations change over time to accommodate new actors, knowledge, and communities. This yields a high-volume, high-velocity data stream with noise and drift as the underlying conversations about the pandemic transition from confusion to information to misinformation and disinformation on vaccines, masks, and the virus spread.

To handle this dynamic stream and changing keywords, EDNA’s data collection needs to address concept drift – both detection and adaptation. To this end, we describe the following key blocks in the data collection architecture:

Metadata extractor: This block extracts the tweet object from the streaming record and performs some data cleaning in discarding malformed, empty, or irrelevant tweets. Tweets are kept if they contain coronavirus related keywords: coronavirus, covid-19, ncov-19, pandemic, and mask. To capture Chinese social data, we also include these keywords in Mandarin. We initially included the keyword “china” during data collection in January and February, but decided to omit the phrase since it introduced significant noise, and any tweets with the keyword that were relevant to coronavirus already include the above keywords.

Misinformation Keywords: We obtain a collection of misinformation keywords from Wikipedia and from CoAID; the keywords from the former are obtained with a Wikipedia plugin that reads the misinformation article each day. The keywords from the latter are provided directly to EDNA since they are not updated and do not need to be retrieved repeatedly.

Misinformation Filtering: We check whether the grouped tweet objects contain any of the identified misinformation keywords. We continuously update the Misinformation Keywords dataset to ensure the filtering remains up-to-date