Skip to content

How is _status.json supposed to relate to what work has already been done? #828

@budachst

Description

@budachst

I am still fiddling around to find a way to have fscrawler index multiple folders at once, while writing to the same index on my ES. If anything goes south in any of those jobs, where fscrawler might have crashed or some other error occured, no _status.json gets written and thats fine, but…

{ "name" : "352226", "lastrun" : "2019-10-14T16:36:06.748", "indexed" : 8037, "deleted" : 0 }

this one above doesn't really reveal it's relation to what has been indexed and what not, or does it? So my question is, what's the relation with this and how does fscrawler determine, what other files do still have to be indexed?

In any way, fscrawler would have to query ES if the file it is about to index has already been indexed, but I just can't wrap my head around, as of what the lastrun value would have to do with that. Maybe it's to constrain the search in the index, but this only seems to have a limited benefit.

However, if the _status.json ist missing, it seems that ES doesn't even get checked for already indexed files and all files are indexed again… this is what --restart is supposed to do, I guess, but this behaviour totally denies to have another run on after an unsuccessful one, without having to only index the files, that didn't make it in the previous run. This is very cumbersome on some 14TB volume…

Hope this makes sense… ;)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions