FIFA World Cup 2022 Knowledge Graph

Features

The application, built with Streamlit, provides a multi-faceted interface for interacting with the knowledge graph.

1. Structural Query Retrieval

Execute a curated list of complex, pre-written Cypher queries. Examples include:

Finding "Super-Subs": Identify players who scored a goal after entering a match as a substitute.
Team Logistics: List all stadiums and cities where a specific team played.
Performance Analytics: Find players who received a card but never scored a goal.
Manager Insights: Discover the manager of the tournament-winning team and see which awards their players have won.

2. Semantic Similarity Search

Beyond the simple data retrieval, you can find entities that are semantically similar based on their roles and relationships within the graph. This feature allows you to compare three different similarity algorithms:

Jaccard Similarity (Structural): Finds entities that share the most common neighbors in the graph. Calculated directly in Neo4j using the Graph Data Science (GDS) library.
KGE - TransE (Distributional): Uses embeddings from a trained TransE model to find entities that are close in the learned vector space, based on translational relationships.
KGE - ComplEx (Distributional): Leverages a more powerful Complex Embeddings model to capture more nuanced and complex relationships (symmetric, asymmetric).
Shortest Path Visualization: Automatically computes and displays the shortest path between your query entity and the top-ranked similar entity, revealing the hidden connections that link them.

3. LLM-driven Graph Querying (RAG)

Interact with the knowledge graph using natural language. This feature employs a Retrieval-Augmented Generation (RAG) pipeline:

Text-to-Cypher: A Large Language Model (Google's Gemini) translates your plain English question into a precise Cypher query.
Graph Retrieval: The generated query is executed against the Neo4j database to retrieve relevant data.
Natural Language Answer: The retrieved data is passed back to the LLM, which synthesizes it into a clear, concise answer to your original question.

4. Player Similarity by Image (Query-by-Example)

Find players who look similar using an image-based search:

Upload an Image: Provide a photo of a player.
Embedding Extraction: The application calculates a vector embedding of the uploaded image using a pre-trained EfficientNet-B0 model.
Similarity Search: It then queries the Neo4j database to find the top 5 players whose pre-calculated image embeddings have the highest Cosine Similarity to the query image's embedding, powered by the GDS library.
Visual Results: The results are displayed showing the similar players' photos, names, and similarity scores.

Setup and Installation

Follow these steps to set up the project locally.

1. Python Environment

First, create and activate a Python virtual environment

# Create the virtual environment
python -m venv venv

# Activate it (on Windows)
.\venv\Scripts\activate

# Activate it (on macOS/Linux)
source venv/bin/activate

Then, install the required dependencies:

pip install -r requirements.txt

2. Neo4j Desktop Setup

You'll need Neo4j Desktop with an active DBMS instance (Enterprise Edition is recommended to use GDS).

Create a local DBMS
Configure Settings:
- Open the settings for your DBMS.
- Add the following line to enable Neosemantics (n10s) RDF procedures:
```
dbms.unmanaged_extension_classes=n10s.endpoint=/rdf
```
- Add the following line to grant unrestricted access to GDS, APOC, and n10s procedures:
```
dbms.security.procedures.unrestricted=jwt.security.,n10s.,apoc.,gds.*
```
Install Plugins:
- In your DBMS view, click Open folder -> Plugins.
- Download the JAR files for APOC, Neosemantics (n10s), and Graph Data Science (GDS) compatible with your Neo4j version.
- Place the downloaded .jar files into this plugins folder.
Configure APOC:
- Go back and click Open folder -> Configuration.
- Create a file named apoc.conf (if it doesn't exist).
- Add the following lines to the file to enable file import/export capabilities:
```
apoc.import.file.enabled=true
apoc.export.file.enabled=true
```
Restart the DBMS for the changes to take effect.

3. One-Time GDS Graph Projection

Before running the similarity algorithms, you need to project your graph into an in-memory format optimized for GDS. This only needs to be done once.

Open the Neo4j Browser and run the following Cypher query:

CALL gds.graph.project(
    'fifa_graph',   -- The name we give to the in-memory graph
    '*',            -- Use all node labels
    '*'             -- Use all relationship types
)
YIELD graphName, nodeCount, relationshipCount

This command creates an in-memory graph named fifa_graph, which will be used by the Jaccard similarity functions.

4. Environment Variables

Create a .env file in the root of the project directory and add yout credentials:

NEO4J_URI="bolt://localhost:7687"
NEO4J_USER="neo4j"
NEO4J_PASSWORD="your_password"
NEO4J_DATABASE="your_database_name"
GOOGLE_API_KEY="your_google_api_key"

Running the Application

Once the setup is complete, launch the Streamlit app with the following command:

streamlit run app.py

Project Structure

.
├── 📂 data_preprocessing/       # Notebook for preprocessing and merging datasets
|   ├── data_preprocessing_competition_stats.ipynb  
│   ├── data_preprocessing_player_stats.ipynb     
│   └── data_preprocessing.ipynb                  
│   └── 📂 data/ 
│       ├── competition_stats.csv
│       ├── player_stats.csv
│       └── 📂 world_cup/
│           ├── world_cup_complete.csv
│           └── 📂 processed/
│               ├── class_mappings.csv
│               └── object_properties_mappings.csv
│
├── 📂 evaluation/
│   ├── 📂 images/              # Contains 5 images for each of the 32 players for evaluation
│   └── 📂 models/              # Assets for KGE model training and evaluation
│       ├── create_ground_truth.py  # Script to create the ground truth for semantic sim. evaluation
│       ├── fifa_triplets.tsv             # Triplets used for KGE model training
│       ├── ground_truth_images.csv       # Ground truth for image search evaluation
│       ├── image_sim_evaluation.py       # Evaluation script for image semantic similarity
│       ├── semantic_evaluation.py        # Evaluation script for entity semantic similarity
│       └── semantic_ground_truth.csv     # Ground truth for entity semantic similarity
│
├── 📂 models/
│   ├── complex_fifa.pt         # Trained ComplEx model artifact
│   └── transe_fifa.pt          # Trained TransE model artifact
│
├── 🖼️ player_images/            # Contains 10 images for players from each team in every group
│
├── app.py                      # Main Streamlit application file
├── fifa_triplets.tsv           # Triplets for KGE model training (used by app.py)
├── fifa_wc_ontology.ttl        # Domain ontology created with Protégé
├── image_embedder.py          # Script to calculate image embeddings
├── image_embedding_loader.py  # Script to load embeddings as properties for Player nodes
├── image_embeddings_avg.json   # Calculated average player embeddings
├── kg_loader.py                # Script for Ontology-aware KG creation in Neo4j
├── requirements.txt            # Python dependencies for the virtual environment
└── train_kge.py                # Script for training the KGE models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FIFA World Cup 2022 Knowledge Graph

Table of Contents

Features

1. Structural Query Retrieval

2. Semantic Similarity Search

3. LLM-driven Graph Querying (RAG)

4. Player Similarity by Image (Query-by-Example)

Setup and Installation

1. Python Environment

2. Neo4j Desktop Setup

3. One-Time GDS Graph Projection

4. Environment Variables

Running the Application

Project Structure

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_preprocessing		data_preprocessing
evaluation		evaluation
models		models
.gitignore		.gitignore
IR_Project_Docs_Bruno.pdf		IR_Project_Docs_Bruno.pdf
README.md		README.md
app.py		app.py
export_triplets.py		export_triplets.py
fifa_triplets.tsv		fifa_triplets.tsv
fifa_wc_ontology.ttl		fifa_wc_ontology.ttl
image_embedder.py		image_embedder.py
image_embedding_loader_kg.py		image_embedding_loader_kg.py
image_embeddings_avg.json		image_embeddings_avg.json
kg_loader.py		kg_loader.py
requirements.txt		requirements.txt
train_kge.py		train_kge.py

vlb20/FIFA_WC_2022_Knowledge_Graph

Folders and files

Latest commit

History

Repository files navigation

FIFA World Cup 2022 Knowledge Graph

Table of Contents

Features

1. Structural Query Retrieval

2. Semantic Similarity Search

3. LLM-driven Graph Querying (RAG)

4. Player Similarity by Image (Query-by-Example)

Setup and Installation

1. Python Environment

2. Neo4j Desktop Setup

3. One-Time GDS Graph Projection

4. Environment Variables

Running the Application

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages