减灾研究历史数据集专题 I 区论文(评审中) 版本 EN3
下载
A social media-based dataset of typhoon disasters, 2017
 >>
: 2017 - 12 - 14
: 2018 - 01 - 22
10980 69 0
Abstract & Keywords
Abstract: Typhoon as a disaster that happens every year causes major life and property loss in the Northwestern Pacific region. And during the typhoon events, social media serve as an effective tool to transmit and acquire disaster information in real-time manner. Texts and photos from social media can be used as a way of crowd sourcing to extract disaster loss information, analyze human behaviors and formulate responses. The dataset presented here consists of social media-based data collected from "Sina-Weibo" microblogs, "WeChat" articles, and "Baidu" news about the typhoon events in 2017, covering Typhoon "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar". We mainly collected text data from these social media platforms and websites, which were then cleaned for redundancy and irrelevance. This dataset can be used for deeper disaster information mining of the typhoon events.
Keywords: typhoon; social media; disaster reduction; data mining
Dataset Profile
Chinese title2017年台风灾害社交媒体数据集
English titleA social media-based dataset of typhoon disasters, 2017
Corresponding authorXieJibo (xiejb@radi.ac.cn)
Data authorsYang Tengfei, Xie Jibo, Li Guoqing
Time range2017
Geographical scope15°N – 30°N, 101°E – 132°E; specific areas include: southeast China and surrounding area
Data volume1.70 GB
(9749 texts from "Baidu" news and "WeChat" Subscription; 9601 records from "Sina-Weibo")
Data format.html, .xls, .sql
Data service system<http://www.sciencedb.cn/dataSet/handle/547>
Sources of fundingNational Key R&D Program of China(2016YFE0122600)
Dataset compositionThis dataset consists of two compressed (ZIP) files, which are "Data.zip" and "Classification example.zip". Among them, "Data.zip" is made up of eight subfolders,which are "Haitang", "Hato", "Khanun", "Mawar", "Merbok", "Nesat", "Pakhar", "Roke". Social media data are stored in these subfolders in different formats, which include ".html", ".xls" and ".sql". "Classification example.zip" is made of seven subfolders which represent seven large categories of disaster losses, respectively. Each subfolder contains a few subfolders which represent small categories under corresponding large categories. These data are saved in XLS format.
Data.zip:
● XLS file: Texts from social media are stored in XLS format in a structured form.
● SQL file: Users can execute the SQL file in their own MySQL database to import the data which contains of structured texts from social media.
● HTML file: It is used to store original web pages retrieved from "Baidu" news and "WeChat" Subscription.
Classification example.zip:
● XLS file:It is used to store data of disaster loss. Each file corresponds to a specific category of disaster loss.
1.   Introduction
Typhoon caused major losses to human life and property each year in the Northwestern Pacific region. How to quickly collect information and make reasonable responses is an urgent problem faced by disaster relief departments. Crowd-sourcing and citizen observation has been an effective method to obtain disaster information, among which social media in particular provide near real-time information during the disaster period. By making full use of the dynamic information collected by social media, the disaster relief department can get timely information about the disaster events and people's responses to them. Research has been done on the mining of disaster information based on social media data. Evidence shows that people's behavior is greatly influenced by social media when disasters occur.1 A study commissioned by the American Red Cross2 found that more than half of the respondents believed that government agencies should monitor social media to acquire timely and effective disaster information. As to how to use social media data to mine valuable disaster information, Chae J et al.3 used Tweet data for hurricane disaster analysis, and the results provided support for the government departments' policy decision-making. Some studies4,5 built disaster event classifiers based on microblog data for disaster event identification, which detects the disaster from the perspective of public observation.
Collecting useful information for disaster events from social media is quite time-consuming and complicated due to unstructured expression. Although some social media platforms provide the API (Application Program Interface) for public information access, but they also set restrictions to limit the information we get. For example, we can't get information which relate to a specific location and time period through API. This undoubtedly increases the workload of data processing later. Therefore in our research project, we developed a toolkit to automatically harvest and process the social media-based disaster information. And we used the toolkit to generate a typhoon disaster dataset in 2017 based on several social media platforms. The dataset is mainly composed of text data that come from "Sina-Weibo" microblogs, "WeChat" Subscription and "Baidu" news. Figure 1 shows social media's near real-time feature in transmitting typhoon disaster information.


Figure 1   Disaster information from "Sina-Weibo" microblogs and "WeChat" Subscription
2.   Data collection and processing
2.1   Overview
The dataset records information on the following eight typhoon events: "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar" (Table 1).
Tabel 1   The list of typhoons in 2017
No.NameLandfall time
1Merbok2017/6/12
2Roke2017/7/23
3Nesat2017/7/29
4Haitang2017/7/30
5Hato2017/8/23
6Pakhar2017/8/27
7Mawar2017/9/3
8Khanun2017/10/16
Keyword search was used to retrieve data from "WeChat" Subscription and "Baidu" news. For example, when "Typhoon Hato 2017" was entered into "Baidu" news, the "Baidu" search engine would return the news related to "Typhoon Hato" in 2017. WebCrawler was used to conduct the search and to automatically generate relevant contents. At the same time, we parsed and cleaned these texts and then stored them into the database in a structured form. The same method was used to obtain data from "WeChat" Subscription. For "Sina-Weibo", we used the advanced search function of the platform to obtain data related to the typhoon events. According to the track of the typhoon events (Figure 2), we selected the name of the Typhoon plus the characters "台风(Typhoon)" as the key words for setting retrieval conditions.


Figure 2   Tracks of the typhoon events in 2017
Source: "Tianditu" (http://map.tianditu.com/).
2.2   Data collection process
We developed a social media data harvesting system with functions of data collecting, parsing, cleaning, and management as shown in Fig.3. The data collecting module was implemented using Internet crawler technology and then the data were parsed into a structured form. The HTML pages from "WeChat" Subscription and "Baidu" news were stored in their original HTML format. The steps of data cleaning include removing duplicated information, translating traditional Chinese into simplified Chinese, translating full-width characters into half-width characters, etc. Finally, these data were stored in the structured form.
The structure of the data is shown in Table 2.


Figure 3   Flowchart of the social media data harvesting system
Table 2   Structure of the data
File(.zip)FolderFolderFile(.xls, .sql, .html)Notes
Data.zipbaiduHaitang
Hato
Khanun
Mawar
Merbok
Nesat
Pakhar
Poke
.xls
.sql
.html
.html: Users can parse the page themselves according to their research needs.
.sql: User can execute the SQL file in their own MySQL database to import the data into it.
.xls: Users can use the data directly through the XLS file.
wechat
weibo.xls
.sql
2.3   Data classification
Social media data contains a lot of disaster loss information. We provide a classification example in our dataset according to the type of disaster loss. The raw data in this classification example are all from "Sina-Weibo" microblogs which are related to typhoon “Hato" event in Zhuhai. Users can classified the rest data in the dataset by referring to the classification example we provide or according to their own needs in research. The seven large categories include social effects, forestry, fisheries, traffic, electric power, communication and infrastructure damage. One large category contains several small categories as shown in Fig. 4. For example, the category of social effects contains casualty, water shortage, building damage, and market shutdown. The structure of data classification is shown in Table 3.


Figure 4   Category of disaster loss
Table 3   The structure of data classification
File(.zip)Folder(large category)Folder(small categories)File(.xls)
Classification example.zipsocial effectscasualty.xls
Notes: The XLS file contains all the information of the related category, such as id, keyword, province, city, content, picture, location, release time, platform, number of forwards, comments, number of likes.
water shortage
building damage
market shutdown
forestrydestruction of trees and plants
fisheriesloss of fishing ground
damage of fishing boats
traffictraffic congestion
vehicle damage
electric powerelectric powercutoff
damage of electric power equipment
communicationinterruption of networks and signals
infrastructure damagedamage of street lamps, billboards, bridges, roads, and so on
3.   Sample description
Data fields for "Sina-Weibo" includes id, keyword, province, city, content, picture, location, release time, platform, number of forwards, comments, number of likes as shown in Table 4. Each column has a limit of no more than 140 characters. The topics of the dataset include property loss, traffic impact, casualties, power supply, communication impact, rescue arrangements, response measures, and public attitudes to typhoon disaster, etc.
Table 4   Data from "Sina-Weibo"
id210
keywordtypoon
provinceGuangdong Province
cityZhuhai City
contentAfter the typhoon, Mr. Liu asked me out for a walk to experience the post-disaster Zhuhai. Almost no restaurant was open. Having looked for a long time, finally we found a restaurant which was open. We saw so many cars smashed, trees blown down, and yachts blown ashore. My little white car was scratched by the branches. How can I go to work tomorrow, since Hengqin is so faraway? The last picture, as a tribute to our soldiers!
picturehttp://ww2.sinaimg.cn/square/005WuHsBgy1fiu0v3h5b8j30qo0zkdvg.jpg ;
http://ww4.sinaimg.cn/square/005WuHsBgy1fiu0ul1m9aj30qo0zktks.jpg ;
http://ww3.sinaimg.cn/square/005WuHsBgy1fiu0wqzd5nj30qo0zk4ap.jpg ;
http://ww4.sinaimg.cn/square/005WuHsBgy1fiu0y2re2bj30qo0z
locationZhuhai
release time2017-08-23 22:25
platformiPhone 7
number of forwards
comments
number of likes1
Data fields for "Baidu" news include id, title, link, source, release time, and keyword as shown in Table 5. And the fields for "WeChat" Subscription include id, title, content, source, release time, and keyword as shown in Table 6. The themes of the data include typhoon tracks, disaster loss statistics, government announcements, emergency measures, etc.
Table 5   Data from "Baidu News"
id51
title95 thousand people in Fujian to be relocated under Typhoon "Nesat"and"Haitang"
linkhttp://www.huaxia.com/xw/dlxw/2017/07/5415198.html
sourcehuaxia.com
release time2017-07-31, 15:11
keywordTyphoon Haitang 2017
Table 6   Data from "WeChat" Subscription
id31
titleTyphoon "Haitang" has come!Weihai has become a sea!
contentTyphoon "Haitang" has come!Weihai has become a sea!)
sourceNeurologist [神经科专家] (It is the name of a WeChat account)
release time2017-08-04
keywordTyphoon Haitang 2017
4.   Quality control and assessment
Keywords were diversified and optimized based on social media languages to make sure a maximum retrieval of related typhoon information. After data collection was completed, we cleaned and removed the data where no typhoon events were specified. And also we manually checked the dataset to remove unrelated information.
5.   Value and significance
The data in our dataset can be further analyzed to meet different needs of disaster research. From example, the data presented here can be re-classified according to different criteria so that they can be used as a supplementary source for disaster loss evaluation. The dataset can also be used for further analysis of typhoon disasters such as sentiment analysis, hot words extraction, etc. And we have used the texts in this dataset to train the corpus for automatic identification of typhoon disaster information.
Acknowledgments
This work is supported by the National Key R&D Program of China (2016YFE0122600). We thank Edward T.-H. Chu, Associate Professor at National Yunlin University of Science and Technology, Taiwan, China for his advice on data collection. We thank Li Zhenyu from Shandong University of Science and Technology and Dr. Tian Chuanzhao from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences for their careful examination of our dataset.
1.
National Research Council (U.S.). Public Response to Alerts and Warnings Using Social Media: Report of a Workshop on Current Knowledge and Research Gaps. Washington, DC: The National Academies Press, 2013.
2.
American Red Cross. Social media in disasters and emergencies. Available at: <http://i.dell.com/sites/content/shared-content/campaigns/en/Documents/red-cross-survey-social-media-in-disasters-aug-2010.pdf> (Accessed December 11, 2017).
3.
Chae J, Thom D, Yun J et al. Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.
4.
Zhou Y, Yang L, Walle BVD et al. Classification of microblogs for support[ing] emergency responses: Case Study [of] Yushu Earthquake in China, 2014. Proceedings of the 47th Hawaii International Conference on System Sciences, 2013: 1553 – 1562.
5.
Qu Y, Huang C, Zhang P et al. Microblogging after a major disaster in China: a case study of the 2010 Yushu earthquake. Proceedings of ACM Conference on Computer Supported Cooperative Work, 2011:25 – 34.
Data citation
1. Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. Science Data Bank. DOI: 10.11922/sciencedb.547
稿件与作者信息
How to cite this article
Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. China Scientific Data. DOI: 10.11922/scdata.2017.0014.en (under review).
Yang Tengfei
social media data collection and analysis, writing.
PhD, research area: natural language processing, disaster information mining.
Xie Jibo
motivation of the research, writing.
xiejb@radi.ac.cn
geospatial data infrastructure, remote sensing, geo-computation.
Li Guoqing
advice on dataset design and data check, writing.
PhD, Professor, research area: geospatial data infrastructure, remote sensing, big data.
National Key R&D Program of China (2016YFE0122600)
出版历史
I区发布时间:2018年1月29日 ( 版本EN3
II区出版时间:2018年5月14日 ( 版本EN4
最近更新时间:2018年5月14日 ( 版本EN5
参考文献列表中查看
中国科学数据·对地观测
csdata