Exploiting Geo-tagged Tweets to Understand Localized Language Diversity

Exploiting Geo-tagged Tweets to Understand Localized Language Diversity


Thanaa Ghanem


Social media services are the top-growing online communities in the last few years. Among those, Twitter becomes the de facto of microblogging services with millions of tweets posted every day. In this paper, we present an analytical study for localized language usage and diversity in Twitter data using a half billion geotagged tweets. We first identify local Twitter communities on a country-level.

For the identified communities, we examine: (1) the language diversity, (2) the language dominance within the community and how this differs from local to global views (3) demographics representativeness of tweets for real population demographics, and (4) the spatial distribution of different cultural groups within the countries.

To this end, we group the tweets on two levels. First, we group tweets per country to identify the local communities. Second, we group tweets within each local community based on the tweet language. Our study shows useful insights about language usage on Twitter which provide important information for language-based applications on top of Twitter data, e.g., lingual analysis and disaster management. In addition, we present an interactive exploration tool for the spatial distribution of cultural groups, which provides a low-effort and high-precision localization of different cultural groups inside a certain country.