Most of us are aware that the websites we visit can reveal quite a lot about ourselves – and that these revelations are highly sought after by advertisers. Nearly all companies today are eager to access data about our online behavior so they can show us more relevant content (those Facebook ads can sometimes be scarily accurate).
However, there’s a major hole in the data when it comes to data related to online behavior, and that’s mobile usage. To a large extent, the information available on the Web today is accessed through apps rather than through a browser. Therefore, automatically predicting user traits based on data generated by mobile app usage should be of great importance. (This holds true not only for advertisers, but also for individual users concerned about their privacy since any Android app can access the list of other installed applications without a specific permission.)
I’ve been lucky to have the chance to work as a part-time data scientist at Verto Analytics while getting my PhD; this has given me the opportunity to analyze very unique datasets related specifically to mobile usage. We recently studied issues related to mobile usage with Ingmar Weber and published our results and analysis in a paper (“You Are What Apps You Use: Demographic Prediction Based on User’s Apps”), which was accepted to the ICWSM-16 conference. The dataset we used for this study consisted of 3760 users, the list of the apps they had used at least once, and their demographics (the users were compensated for providing their data).
Essentially, we learned that it’s indeed possible to predict demographic information about users based solely on the kinds of apps they’re using. (The Washington Post actually made a quiz based on our paper, “Quiz: Can we guess your age and income, based solely on the apps on your phone?”, but note that we only reported a few numbers in the paper, so the results of the quiz can’t expected to be too accurate. In any case, the idea is cool! Big Think also reported on the results of our study.)
To study the predictability of different demographic attributes, we binarized each of them and solved the resulting classification problems employing logistic regression. The results are shown in the table below.
Gender is the easiest to predict, whereas household income is the most challenging one. To understand what apps are contributing to the predictions, we listed the apps with the largest regression coefficients. A selection of these apps is shown in the next figure (another perk of working in a company is that you don’t have to MS Paint all the infographics yourself but you have access to a professional designer).
Many of the predictor apps are something you could have expected to see. However, there are some less obvious differences, e.g., high-income households use LinkedIn, whereas low-income households prefer an app called Job Search. Studying the predictor apps for low-income households also reveal an app called ScreenPay which has quite an innovative concept: instead of asking for your money, the app pays you based on how many ads you’ve watched! 😛
We also studied how the number of apps affects predictability. Naturally, based on only a handful of apps, it’s hard to say much about a person but, to our surprise, we discovered that more apps are not always better when it comes to predicting demographics; people with more than 150 apps are harder to predict than people with 50-150 apps.
To wrap up, some demographics (gender, age, race, and marital status) can be predicted with up to an 82% accuracy based on the list of apps you have used, whereas other demographics (number of children and household income) are less predictable. And just in case, you should do as my wife suggested (in her not-so-serious tone) and make sure that the next time you go to a salary negotiation, you have the LinkedIn app installed. 😉
Pingback: The One Thing Missing from Your Data - Verto