The application, built with Streamlit, provides a multi-faceted interface for interacting with the knowledge graph.
Execute a curated list of complex, pre-written Cypher queries. Examples include:
- Finding "Super-Subs": Identify players who scored a goal after entering a match as a substitute.
- Team Logistics: List all stadiums and cities where a specific team played.
- Performance Analytics: Find players who received a card but never scored a goal.
- Manager Insights: Discover the manager of the tournament-winning team and see which awards their players have won.
Beyond the simple data retrieval, you can find entities that are semantically similar based on their roles and relationships within the graph. This feature allows you to compare three different similarity algorithms:
- Jaccard Similarity (Structural): Finds entities that share the most common neighbors in the graph. Calculated directly in Neo4j using the Graph Data Science (GDS) library.
- KGE - TransE (Distributional): Uses embeddings from a trained TransE model to find entities that are close in the learned vector space, based on translational relationships.
- KGE - ComplEx (Distributional): Leverages a more powerful Complex Embeddings model to capture more nuanced and complex relationships (symmetric, asymmetric).
- Shortest Path Visualization: Automatically computes and displays the shortest path between your query entity and the top-ranked similar entity, revealing the hidden connections that link them.
Interact with the knowledge graph using natural language. This feature employs a Retrieval-Augmented Generation (RAG) pipeline:
- Text-to-Cypher: A Large Language Model (Google's Gemini) translates your plain English question into a precise Cypher query.
- Graph Retrieval: The generated query is executed against the Neo4j database to retrieve relevant data.
- Natural Language Answer: The retrieved data is passed back to the LLM, which synthesizes it into a clear, concise answer to your original question.
Find players who look similar using an image-based search:
- Upload an Image: Provide a photo of a player.
- Embedding Extraction: The application calculates a vector embedding of the uploaded image using a pre-trained EfficientNet-B0 model.
- Similarity Search: It then queries the Neo4j database to find the top 5 players whose pre-calculated image embeddings have the highest Cosine Similarity to the query image's embedding, powered by the GDS library.
- Visual Results: The results are displayed showing the similar players' photos, names, and similarity scores.
Follow these steps to set up the project locally.
First, create and activate a Python virtual environment
# Create the virtual environment
python -m venv venv
# Activate it (on Windows)
.\venv\Scripts\activate
# Activate it (on macOS/Linux)
source venv/bin/activateThen, install the required dependencies:
pip install -r requirements.txtYou'll need Neo4j Desktop with an active DBMS instance (Enterprise Edition is recommended to use GDS).
-
Create a local DBMS
-
Configure Settings:
- Open the settings for your DBMS.
- Add the following line to enable Neosemantics (n10s) RDF procedures:
dbms.unmanaged_extension_classes=n10s.endpoint=/rdf
- Add the following line to grant unrestricted access to GDS, APOC, and n10s procedures:
dbms.security.procedures.unrestricted=jwt.security.,n10s.,apoc.,gds.*
-
Install Plugins:
- In your DBMS view, click
Open folder->Plugins. - Download the JAR files for APOC, Neosemantics (n10s), and Graph Data Science (GDS) compatible with your Neo4j version.
- Place the downloaded
.jarfiles into thispluginsfolder.
- In your DBMS view, click
-
Configure APOC:
- Go back and click
Open folder->Configuration. - Create a file named
apoc.conf(if it doesn't exist). - Add the following lines to the file to enable file import/export capabilities:
apoc.import.file.enabled=true apoc.export.file.enabled=true
- Go back and click
-
Restart the DBMS for the changes to take effect.
Before running the similarity algorithms, you need to project your graph into an in-memory format optimized for GDS. This only needs to be done once.
Open the Neo4j Browser and run the following Cypher query:
CALL gds.graph.project(
'fifa_graph', -- The name we give to the in-memory graph
'*', -- Use all node labels
'*' -- Use all relationship types
)
YIELD graphName, nodeCount, relationshipCountThis command creates an in-memory graph named fifa_graph, which will be used by the Jaccard similarity functions.
Create a .env file in the root of the project directory and add yout credentials:
NEO4J_URI="bolt://localhost:7687"
NEO4J_USER="neo4j"
NEO4J_PASSWORD="your_password"
NEO4J_DATABASE="your_database_name"
GOOGLE_API_KEY="your_google_api_key"
Once the setup is complete, launch the Streamlit app with the following command:
streamlit run app.py.
├── 📂 data_preprocessing/ # Notebook for preprocessing and merging datasets
| ├── data_preprocessing_competition_stats.ipynb
│ ├── data_preprocessing_player_stats.ipynb
│ └── data_preprocessing.ipynb
│ └── 📂 data/
│ ├── competition_stats.csv
│ ├── player_stats.csv
│ └── 📂 world_cup/
│ ├── world_cup_complete.csv
│ └── 📂 processed/
│ ├── class_mappings.csv
│ └── object_properties_mappings.csv
│
├── 📂 evaluation/
│ ├── 📂 images/ # Contains 5 images for each of the 32 players for evaluation
│ └── 📂 models/ # Assets for KGE model training and evaluation
│ ├── create_ground_truth.py # Script to create the ground truth for semantic sim. evaluation
│ ├── fifa_triplets.tsv # Triplets used for KGE model training
│ ├── ground_truth_images.csv # Ground truth for image search evaluation
│ ├── image_sim_evaluation.py # Evaluation script for image semantic similarity
│ ├── semantic_evaluation.py # Evaluation script for entity semantic similarity
│ └── semantic_ground_truth.csv # Ground truth for entity semantic similarity
│
├── 📂 models/
│ ├── complex_fifa.pt # Trained ComplEx model artifact
│ └── transe_fifa.pt # Trained TransE model artifact
│
├── 🖼️ player_images/ # Contains 10 images for players from each team in every group
│
├── app.py # Main Streamlit application file
├── fifa_triplets.tsv # Triplets for KGE model training (used by app.py)
├── fifa_wc_ontology.ttl # Domain ontology created with Protégé
├── image_embedder.py # Script to calculate image embeddings
├── image_embedding_loader.py # Script to load embeddings as properties for Player nodes
├── image_embeddings_avg.json # Calculated average player embeddings
├── kg_loader.py # Script for Ontology-aware KG creation in Neo4j
├── requirements.txt # Python dependencies for the virtual environment
└── train_kge.py # Script for training the KGE models

