Jump to main content
US EPA
United States Environmental Protection Agency
Search
Search
Main menu
Environmental Topics
Laws & Regulations
About EPA
Health & Environmental Research Online (HERO)
Contact Us
Print
Feedback
Export to File
Search:
This record has one attached file:
Add More Files
Attach File(s):
Display Name for File*:
Save
Citation
Tags
HERO ID
7204235
Reference Type
Journal Article
Title
Incorporating Dialectal Variability for Socially Equitable Language Identification
Author(s)
Jurgens, D; Tsvetkov, Y; Jurafsky, Dan; ,
Year
2017
Publisher
ASSOC COMPUTATIONAL LINGUISTICS-ACL
Location
STROUDSBURG
Page Numbers
51-57
DOI
10.18653/v1/P17-2009
Web of Science Id
WOS:000493992300009
Abstract
Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-to-sequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves state-of-the-art performance on multiple LID benchmarks. Furthermore, in a case study using Twitter for health tracking, our method substantially increases the availability of texts written by underrepresented populations, enabling the development of "socially inclusive" NLP tools.
Editor(s)
Barzilay, R; Kan, MY;
Conference Name
55th Annual Meeting of the Association-for-Computational-Linguistics (ACL)
Conference Location
Vancouver, CANADA
Home
Learn about HERO
Using HERO
Search HERO
Projects in HERO
Risk Assessment
Transparency & Integrity