LiveRamp’s core products rely on procuring and processing links among diverse identifiers that relate to the same person. We make a strong distinction between PII (Personally Identifiable Information – anything which could trace back to a real-world person) and anonymous data. On the PII side of things, we have treated links among identifiers as edges in a graph for a long time, as described here. On the anonymous data side, we initially only had one type of identifier, the cookie, and thus a simple tuple store (i.e. identifier pairs stored within Hadoop’s Distributed File System) was enough to represent all links.
As our product was expanded with new use cases, new types of anonymous identifiers (e.g. mobile device id) and links entered the picture, and the business rules governing what types of links can be used for each product also became more complicated. We started by creating a separate tuple store for each new identifier type, but this meant that any consumer would have to be rewritten with type-specific logic to be able to read from the new store. Eventually this approach became unsustainable and we needed to find a way to make the system much more flexible. The solution we came up with is to represent all links among anonymous identifiers as a single identity graph.
We employ the Property Graph Model, in which vertices are identifiers, edges are pairs of vertices with direction (since we need undirected edges, every edge is represented as two edges), and both vertices and edges can have properties associated with them. Our problem can be described as follows: given a source identifier, find all allowed paths that lead to identifiers of a certain type, using rules that dictate which types of vertices and edges can be used along the way.
Consider the example graph and rules in the following figures.
Applying rule 1 to the graph would result in two possible paths: 3-4-5-9 and 3-4-8-9, as shown below.
Similarly, applying rule 2 to the graph would result in the path 2-6-7-10-11.
Notice that an algorithm for computing connected components (which assigns a common id to all vertices that belong to the same connected component) is of little use here since the vertices reachable by each vertex depend on the specific rules that apply to the type of that vertex. Generally speaking, the connected component a vertex belongs to is much larger than the group of all the vertices reachable from that vertex, given the restrictions imposed by the rules. In the example, even though vertices 2 and 3 belong to the same connected component, they are not reachable from one another when following the rules.
Our first implementation is based on map-reduce. The general strategy is as follows:
- For each vertex, we precompute and store the subgraph containing all reachable vertices.
- A series of map-reduce jobs builds paths to all reachable vertices, one out-degree at a time. During this process, all potentially useful paths (i.e. paths admitted by some rule) are included.
- A final job aggregates all paths that start with a common vertex and creates a list of all edges involved.
- Consumers use the precomputed subgraphs to avoid having to process the graph directly themselves. For a given source identifier, rules specific to a particular product are used to reconstruct the permissible path(s) to desired target identifiers using the precomputed list of relevant edges.
If we use the example graph above as input, the process would work as follows. First, an initial job joins the input edges with themselves to create a set of partial paths. Any paths not admissible by a rule are filtered out.
Then, a series of intermediate jobs iteratively extend the partial paths by joining them with the input edges. Only paths that are incomplete and need to be extended are passed on to the next iteration.
Once no paths can be extended, a final job groups all complete paths by source identifier and creates objects that contain all the relevant edges.
This process requires one map-reduce job for each edge traversed on the path from source to target. As a consequence, the overall computation will need as many jobs as the number of edges in the longest allowed path present in the data (4 in the example).
The map-reduce model is not very efficient for processing graphs. Step 1 above needs 5 jobs and takes 15 hrs on our 6k mapper / 3k reducer Hadoop cluster to process a graph with approximately 2 billion vertices and ten of billions of edges. Moreover, by storing the entire list of relevant edges for each vertex we incur considerable storage overhead, since many edges are duplicated in the store.
In the long term we plan to keep using the same data model, but move away from the map-reduce implementation. We are exploring two alternatives:
- Using a Pregel-like graph computation engine to replace map-reduce
- Load edges into a distributed random-access store and compute paths on the fly
Stay tuned for future posts exploring these alternatives.