I vividly remember the moment I discovered the power of graphs. It was during my university years, while studying optimization algorithms. I was fascinated by how complex problems and constraints could be modeled with simple nodes and connections. Later, in a bioinformatics project on protein functional annotation, I used sequence similarity networks to observe how similar proteins were in their composition and how this correlated with their similarity in enzymatic activity. I was amazed by how they revealed hidden patterns in the data and how you could literally see complex relationships between members of a protein family. Adjusting the resolution of how "far" or "close" you wanted to see the similarity distance allowed for observing complex connections that matched evolutionary patterns, functional sites, and structural features shared among the clusters generated by the graph representation.
That's when I truly understood how versatile graphs can be and how much you gain by abstracting your data into a more flexible representation. Seeing your data from another angle—adjusting the resolution to see it closer or farther, establishing new connections, or simply transforming a table into a graph—opens the door to discovering patterns you never imagined. Sometimes, you just need to look at them differently to start asking truly interesting questions for your exploration.
What are Graphs?
A graph is a data structure that models relationships between objects. It consists of nodes (or vertices), which represent entities, and edges (or links), which represent connections or relationships between those entities. Both nodes and relationships can have properties or attributes, such as primary keys and temporality. The beauty of graphs lies in their ability to represent complex information structures in an intuitive and visual way.
To illustrate, let's think of a social network. Each person is a node, and the friendship relationships between them are the edges. By analyzing this network, we can identify communities, detect influencers, and predict how news spreads. Another more technical example is transportation networks: airports can be represented as nodes and flights as edges, allowing for a network representation that can be used to optimize routes and minimize wait times, for example.
Applications of Graph Data Science
Graph Data Science (GDS) uses the power of graphs to solve complex problems more efficiently than traditional table-based approaches. Strictly speaking, tabular databases contain the same information as a graph: entities (rows) and their relationships (through foreign keys or joins). However, extracting connectivity patterns or identifying complex structures in a relational table can be extremely difficult and computationally expensive. The main limitations of tabular representations include:
- Expensive and unintuitive queries: In a relational database, answering questions like "How close are two customers within a transaction network?" or "What is the most influential community in a social network?" requires multiple joins and complex calculations. In contrast, in a graph, these analyses are direct and more efficient using algorithms like shortest path or community detection.
- Loss of structural information: Although tabular data can model relationships, it does not naturally capture the interconnectivity of the data. For example, in a recommendation system based on user-product relationships, a table only reflects explicit interactions, while a graph allows inferring new connections based on structural similarity that emerges from the graph's topography.
- Limited scalability in complex relationship analysis: In datasets with many relationships between entities, the number of joins and possible combinations grows exponentially, making queries prohibitively slow. Graphs, on the other hand, are designed to scale in high-order relationship analysis without significant performance degradation.
Therefore, in problems where relationships play a key role—such as fraud detection, recommendation engines, or social network analysis—graphs allow for representing information more intuitively. Leading companies have integrated this technology in multiple areas with surprising results:
- Recommendation systems at Uber Eats: They implemented Graph Neural Networks (GNNs) to improve personalization in restaurant and dish recommendations. During testing, they observed an improvement of over 20% in key metrics such as Mean Reciprocal Rank (MRR), Precision@K, and NDCG, compared to the previous production model [Uber Blog, 2019].
- Recommendation systems at Netflix: The platform has integrated Graph Neural Network (GNN) models to improve the accuracy of its recommendations. A recent study highlights that, by combining co-engagement signals and semantic links through GNNs, Netflix achieved up to a 35% improvement in performance in content similarity evaluation tasks [Huang, Zijie, et al., 2023].
- Fraud detection in financial institutions: The adoption of machine learning models that incorporate graph analysis has proven highly effective in detecting fraudulent activities. For example, a study on optimizing anti-money laundering alerts using machine learning with graphs showed that it is possible to reduce the number of false positives by 80% while detecting more than 90% of truly positive cases [Eddin, Ahmad Naser, et al., 2021].
Graph Neural Networks (GNNs)
Remember that a graph is simply another way to represent data. And if your application requires finding more complex patterns, using deep learning sounds like an excellent option. Graph Neural Networks (GNNs), in simple terms, apply deep learning techniques using the graph's topology as a propagation pattern. This means that the graph's relationships or edges will be responsible for transferring information through the neural layers, in a process known as message passing.
How GNNs Work
GNNs operate by exchanging information between connected nodes to update their representations based on the graph's structure and the characteristics of neighboring nodes.
Message Passing Stages
- Initialization: Each node starts with a feature vector that represents its attributes. For example, in a social network, a node representing a user could include data such as age, interests, and activity level.
- Message aggregation: In this phase, each node collects information from its neighbors using various methods, such as:
- Sum (pooling): The messages from neighbors are summed.
- Average: The average of the messages is taken.
- Maximum: The highest value among the messages is selected.
- Node update: Once the information from the neighbors is aggregated, each node updates its feature vector based on its previous state and the new messages received. This is done with neural network layers, such as fully connected layers or recurrent networks.
- Layer stacking: By adding multiple message passing layers, the target node receives information from its direct neighbors and also from the neighbors of its neighbors. This allows for capturing more complex relationships in the graph.
- Final representation: After several iterations, the nodes have enriched representations that can be used for tasks such as classification, link prediction, or clustering.
If you want more details on how GNNs work, I recommend this article from distill.pub with interactive explanations.
We will illustrate with an example from Pinterest, which uses PinSAGE, a GNN-based model, to improve image (pin) recommendations. Imagine a user saves images of cooking recipes on their board. Each image is connected to other images through "saved by other users" interactions, common tags, or visual similarity.
Example:
- Pin A: "Homemade Pizza"
- Pin B: "Garlic Bread"
- Pin C: "Caprese Salad"
- Pin D: "Chocolate Ice Cream"
Here, nodes A, B, and C are more connected because they are savory recipes, while D is more isolated because it is a dessert. Then PinSAGE selects the most relevant neighbors based on similarity, randomly sampling the graph and prioritizing more relevant pins.
Example (Neighbors of A):
- B and C are neighbors of A because they have been saved on similar boards.
- D is a distant neighbor (less relevant).
Then, in the message aggregation step, node A (Homemade Pizza) updates its representation by combining information from its neighbors.
Example: Each neighbor node sends its representation to A:
- B (Garlic Bread) sends information about dough and baking.
- C (Caprese Salad) sends information about Italian cuisine.
- Pin A updates its representation by combining the neighbor information with weighted adjustments, using concatenation operations.
The process is repeated with the neighbors of the neighbors. Thus, A also receives information from more distant nodes, such as D (Chocolate Ice Cream), but with less weight.
Example:
- The model learns that A is more related to savory Italian dishes and not desserts.
- The recommendation will avoid suggesting Chocolate Ice Cream (D) to someone exploring savory recipes.
After several rounds of message passing, the final embeddings are used to recommend content.
Recommendation Example:
- If a user saved "Homemade Pizza," the model recommends "Garlic Bread" and "Caprese Salad" instead of a dessert
GNN Architectures for Recommendation Systems
If you're already intrigued and want to learn more about which GNN architectures exist and which is most suitable for each task, I recommend this paper on the challenges, methods, and future directions of GNNs for recommendation systems. In the paper's highlights, it mentions that although all GNNs work on graphs, they don't all do it the same way. There are two main types:
- Spectral models: They rely on the Fourier transform to analyze the graph's structure and apply convolutions in the spectral domain. They are useful in problems where the graph has a well-defined structure. Example: Graph Convolutional Networks (GCN).
- Spatial models: Instead of transforming the graph, these models propagate information directly between nodes, capturing patterns in connectivity. Example: GraphSAGE, Graph Attention Networks (GAT).
Types of GNNs and Their Applications
1. Graph Convolutional Networks (GCN)
Imagine that each node in a graph is a student in a class and each connection represents a friendship. If we want to estimate a student's performance based on their environment, a GCN would take the grades of their friends and calculate a new combined representation.
Where is it used?
- Recommendation systems: Models the user-product relationship on platforms like Netflix and Uber Eats.
- Community detection: Identifies groups in social networks, such as Twitter, with common interests.
2. Graph Attention Networks (GAT)
Unlike GCN, which treats all neighbors equally, GAT is more selective. It weights the importance of each connection before updating the nodes, which is very useful in scenarios where some relationships are more relevant than others.
Where is it used?
- Personalized recommendation: On Spotify, where the relationship between users and songs is not homogeneous.
- Fraud detection: In banks and e-commerce, to identify suspicious transactions with specific patterns in the network.
3. GraphSAGE
Processing an entire graph is often impractical due to its size. GraphSAGE addresses this challenge by sampling neighbors instead of analyzing the entire structure, making it significantly more scalable. For example, Pinterest uses PinSAGE to manage 3 billion nodes and 18 billion relationships in its recommendation system.
Where is it used?
- Logistics optimization: Companies like FedEx and Uber use this type of GNNs to improve delivery planning.
- Real-time recommendations: Ideal for marketplaces that must adapt to constant changes in supply and demand.
4. Hypergraph Neural Networks (HGNN)
Sometimes, relationships are not just between pairs of nodes. For example, in a movie recommendation system, a user may be simultaneously linked to several movies within the same category. Hypergraph Neural Networks (HGNN) extend the traditional graph concept by allowing a single connection (hyperedge) to link multiple nodes at once, modeling group interactions better.
Where is it used?
- Bioinformatics: To analyze protein interaction networks and discover patterns in genetic sequences.
- Multi-item recommendation systems: To understand how users consume combinations of products, such as subscription packages or video game bundles.
The choice of which type of GNN depends on the problem we want to solve. If you are looking for simplicity and stability, GCN is a good starting point. For personalization, GAT allows you to highlight relevant connections. If you work with large graphs (for example, with more than 10 million nodes and 100 million relationships), GraphSAGE optimizes processing. And if you need to model complex relationships, HGNNs are the solution.
Advantages and Limitations of GNNs
One of the main advantages of GNNs is their ability to leverage the graph's topology through message passing, which allows them to capture complex patterns and relationships between nodes more effectively than other models. In addition, these networks can learn rich representations in the weights of the connections, which improves the quality of the predictions. Another notable aspect is their inductive nature: a GNN can be trained on one graph and then applied to make predictions on another graph with similar characteristics, without the need for retraining. This is due to its ability to leverage the graph's structure to learn relational patterns that emerge from its topology. This means that if another graph has a similar structure (but with different nodes), the GNN can still capture the same relationships and dynamics. The properties of the nodes depend not only on their individual attributes, but also on their position in the network and the connections with other nodes. Unlike tabular models, which treat each instance independently, GNNs leverage the connectivity and position of the nodes in the network, extracting relational information more efficiently and allowing better generalization.
However, they also present challenges. They require high memory consumption, as the propagation of information through the graph can be expensive, especially in dense structures. In addition, even with specialized hardware such as GPUs, GNNs can be slow to compute due to the complexity of their operations. Another point to consider is the difficulty of interpretation: understanding how a GNN arrives at a certain prediction is not trivial, which makes its explainability difficult. Finally, its implementation is more complex compared to traditional neural networks, as it requires advanced knowledge of the graph's structure and its efficient processing.
Tips for "Cooking" GNNs
In my experience applying GNNs, I have encountered several surprises and difficulties along the way. These tips I would give to someone who is starting in this world:
- Graph schema design is an iterative process: Avoid the temptation to load all your data into the graph at once, only to end up with gigabytes of an inefficient design. Instead, maintain flexibility to refine the schema based on results and business needs. Before proceeding, validate your structure with queries that address key questions.
- Think like a machine would, in numerical properties: If you are going to use machine learning algorithms, you will probably need to make transformations in the data:
- Encode strings into numerical values.
- Create categorical variables.
- Handle temporality in formats such as timestamps or epoch.
- Consider embeddings to represent nodes and edges. Some node embedding options include:
- Node2Vec: Based on random walks, it leverages a hidden layer to predict the likelihood of node occurrences in these walks.
- HashGNN: Does not require training or model, it is based on random hash functions and only works with binary features.
- Fast Random Projections: Probabilistic technique to generate sparse representations of the graph. Achieves embeddings of comparable quality and at speeds much higher than random walks and neural methods such as Node2Vec or GraphSAGE.
- Also preprocess the edges: Nodes aren't the only important elements. You can apply aggregations or filters to the relationships to reduce redundancy and assign weights that prioritize certain types of connections.
- You will most likely need to create a graph projection: This involves generating a variable—such as in Python or Neo4j—that represents a temporary and/or derived version of the nodes, relationships, and properties of the original graph. This projection is optimized for applying graph data science algorithms efficiently.
- Work with a sampling strategy: Graphs can grow exponentially, which makes it impossible to process them in their entirety. If you can generate a representative subgraph, the patterns learned by the GNN can be generalized to the entire graph.
- Explore before adding complexity: Before launching a GNN, experiment with more classic algorithms such as centrality, similarity and grouping metrics. Many times, a good exploration of the graph will allow you to build interesting hypotheses and reduce the need for an excessively sophisticated model.
If you want to start experimenting with GNNs, an excellent way to do it is through libraries like PyTorch Geometric (PyG) or Deep Graph Library (DGL), which simplify the implementation of models on graphs. A good practice is to begin with unsupervised tasks, such as clustering (which parts of the graph are strongly connected?) or association (which nodes are more similar?), before tackling more complex problems such as node classification (to which category does this node belong?) or link prediction (what connections could be formed in the future?).
Additionally, exploring public graph datasets such as Cora allows you to test different architectures without building a graph from scratch. The key is to iterate, explore and experiment: before diving into advanced models, take the time to understand your graph's structure, apply classic metrics and validate which information truly adds value to your problem. And most importantly, have fun and let yourself be surprised!