-
Notifications
You must be signed in to change notification settings - Fork 30
Description
السلام عليكم ورحمة الله وبركاته
أحسن الله إليكم،
My name is Muhammed Emin Eser from Germany. I am currently working on an algorithm that aims to classify and summarize tafsir books by returning only authentic hadiths that relate to a given ayah. The goal is to make the benefits of complex works—such as Tafsir al-Hafiz Ibn Kathir—more accessible, as these tafasir already establish contextual links between ayat and hadith, thereby simplifying the task of connecting hadiths to verses and presenting them together in a coherent, Mushaf-like format for readers who have not mastered the science of hadith, which is necessary due to the presence of many weak narrations in such works.
As part of this journey, I have already built an adaptive scraper that extracts machine-readable text from websites and stores it efficiently. After scraping Silsilah al-Sahihah and Silsilah al-Da‘ifah by Imam al-Albani, I structured the data so that each hadith occupies a single row. I then worked on extracting the matn from each hadith, which proved to be a very challenging task due to the wide variation in length, structure, and formatting.
I experimented with normalization, regular expressions, tokenization, and n-gram chunking, as well as parameters such as similarity thresholds, minimum and maximum word counts, and minimum matches. I ultimately chose a Jaccard-based similarity approach, as the word order within a hadith is highly likely to be preserved, allowing for higher minimum word thresholds. Based on this, I decided to use a dynamically selected subset of each hadith as the input n for the n-grams, enabling adaptive similarity scoring. Despite this, the results were still not sufficient, and I am continuing to refine the approach.
When I came across your repository, I was genuinely impressed by the results, which motivated me to reach out and propose a collaboration for the sake of Allah. I have since cloned the repository, converted the CSV files into SQLite databases and normalized them, and integrated tafasir sourced from the App Ayat by KSU (as I consider Saudi Arabia a reliable source grounded in tawhid and the Sunnah, and therefore prefer its academic resources for this work). I also split the tafsir texts into smaller blocks to facilitate more accurate matching between hadith texts and tafsir segments.
My next step is to begin comparing hadith columns with the corresponding tafsir blocks. If you would be willing to share the source code you used to separate the mutun from the asanid, it would greatly accelerate my work and help me avoid reinventing solutions you have already refined.
I would be honored to collaborate with you on this project and look forward to your response.
Thank you very much for your time and cooperation.
جزاكم الله خيرًا,
Muhammed Emin Eser