For Hackweek XLIV, our data science team explored how LiveRamp’s demographic data and our identity resolution technologies could be used by sociologists to study human behavior. For example, one could study the correlation between education level and political affiliation. To that end we built a new tool for exploring the correlations in demographic data and we used this tool to discover several interesting facts about human behavior, such as the correlation between education level and political affiliation. This post explains how we studied demographic correlation and shares what we learned about people.
First, a quick bit of background on the data and technologies we used. LiveRamp has a lot of demographic data on people, including thousands of different features such as income, education-level, and political affiliation. The raw demographic data that we purchase is associated to people using a variety of different types of directly identifiable personal data, such as name and postal address, email, and phone number. Our identity resolution technologies associate each of these directly identifiable personal data touchpoints to a person-based id called a RampID (IDL). This allows us to group together data on different identifiers for the same person. Further, for privacy reasons we ensure that an IDL can never be associated back to actual directly identifiable personal data for that person. We simply take the raw demographic data and use identity resolution to replace the directly identifiable personal data with IDLs to give us pseudonymized demographic data.
In our current products, this pseudonymized demographic data is used to target consumers with certain attributes in digital marketing campaigns. In this Hackweek project, we experimented with how this same data could be used to explore human behavior by looking at the correlations between pairs of demographic segments.
To quantify the correlation between values for a pair of segments, we first compute the co-observation counts for each pair of values. This is computed for a small sample of IDLs using MapReduce. For example, the co-observation counts for education and political affiliation is as follows:
We next determine whether a pair of segments has statistically significant correlation between their values using a chi-squared test. In this example of education and political affiliation there is a statistically significant correlation. In fact we found many segments to be significantly correlated with political affiliation as shown in the following table. The “p” column shows the p-value from the chi-squared test, which is the probability that such a correlation would be expected by random chance. The smaller the p-value, the more significant the correlation.
We use normalized pointwise mutual information (PMI) to quantify the correlations between the different values for a pair of segments. The normalized PMI between values for the example is:
We can visualize these correlations in a graph where the edge widths scale with PMI.
We created a web app to explore the correlations between pairs of segments. We used R Shiny to create the app and hosted it on AWS using kubernetes.
We used the app to explore the correlation between many different pairs of segments. Here are a few that we found interesting.
Sports Interests and Political Affiliation
Political Affiliation and Coffee Brand
Diet and Cellphone Brand
These are just a sample of the interesting facts we learned about human behavior. Further, we’re hoping other people at LiveRamp will use this tool to discover additional interesting correlations. Using established associations, we can extrapolate and make inferences about segments that are missing based on the segments that exist. For example, based on the result here, if we know someone is Republican, even if we don’t have any information related to sports interested, we know s/he is more likely than the average individual to be interested in snow skiing.