-
Notifications
You must be signed in to change notification settings - Fork 304
Description
I am still fiddling around to find a way to have fscrawler index multiple folders at once, while writing to the same index on my ES. If anything goes south in any of those jobs, where fscrawler might have crashed or some other error occured, no _status.json gets written and thats fine, but…
{ "name" : "352226", "lastrun" : "2019-10-14T16:36:06.748", "indexed" : 8037, "deleted" : 0 }
this one above doesn't really reveal it's relation to what has been indexed and what not, or does it? So my question is, what's the relation with this and how does fscrawler determine, what other files do still have to be indexed?
In any way, fscrawler would have to query ES if the file it is about to index has already been indexed, but I just can't wrap my head around, as of what the lastrun value would have to do with that. Maybe it's to constrain the search in the index, but this only seems to have a limited benefit.
However, if the _status.json ist missing, it seems that ES doesn't even get checked for already indexed files and all files are indexed again… this is what --restart is supposed to do, I guess, but this behaviour totally denies to have another run on after an unsuccessful one, without having to only index the files, that didn't make it in the previous run. This is very cumbersome on some 14TB volume…
Hope this makes sense… ;)