Jump to main content
US EPA
United States Environmental Protection Agency
Search
Search
Main menu
Environmental Topics
Laws & Regulations
About EPA
Health & Environmental Research Online (HERO)
Contact Us
Print
Feedback
Export to File
Search:
This record has one attached file:
Add More Files
Attach File(s):
Display Name for File*:
Save
Citation
Tags
HERO ID
7702320
Reference Type
Journal Article
Title
Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter
Author(s)
Kumar, SBR; Wang, K; Shen, S
Year
2019
Volume
35
Issue
3
Page Numbers
675-696
Language
English
DOI
10.6688/JISE.201905_35(3).0011
Web of Science Id
WOS:000467782400012
URL
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85065663473&doi=10.6688%2fJISE.201905_35%283%29.0011&partnerID=40&md5=c7c74798f5d713c4dc054599f7a7f6b1
Exit
Abstract
With the use of the Internet, mobile platforms, online commerce, and social media services, the footprints of human behavior can be easily recorded in the digital world, which generates data on an extremely large scale. Twitter as a big data social network becomes one of the most important sources for capturing up-to-date events happened in the world. Deriving topics from Twitter is important for various applications, such as situation awareness, market analysis, content filtering, and recommendations. However, topic derivation with high purity in Twitter is hard to achieve because tweets are limited to 140 characters. Previous works on topic derivation in Twitter suffer from low purity. In this paper, we propose corpus-based topic derivation (CTD) approach that combines a Twitter corpus and LF-LDA, which is a text processing model to identify topics and clusters of similar hashtags. We use asymmetric topic LF-LDA to obtain better purity of topics. Compared to intJNMF, a representative related work, the purity (F-measure) of our proposed CTD increases from 5.26% (27.81%) to 11.32% (34.28%) for 20 to 100 topics. We also propose a timestamp-based popular hashtags prediction (TPHP) approach by creating trending hashtags lists (THLs), which are lists of hashtags used by many users and make use of timestamps in tweets. We use the edit distance to find the difference between consecutive THLs. Then the difference can be used to calculate volatilety to find how people react to real world events. Compared to Hybrid+, a representative related work, the mean average precision (MAP) of our TPHP increases by 19.45% (week-day), 15.08% (week-week) and 16.95% (month-week). © 2019 Institute of Information Science. All rights reserved.
Keywords
corpus; popular hashtag prediction; timestamp; topic derivation; twitter
Home
Learn about HERO
Using HERO
Search HERO
Projects in HERO
Risk Assessment
Transparency & Integrity