How is _status.json supposed to relate to what work has already been done?

I am still fiddling around to find a way to have fscrawler index multiple folders at once, while writing to the same index on my ES. If anything goes south in any of those jobs, where fscrawler might have crashed or some other error occured, no _status.json gets written and thats fine, but…

`{
  "name" : "352226",
  "lastrun" : "2019-10-14T16:36:06.748",
  "indexed" : 8037,
  "deleted" : 0
}`

this one above doesn't really reveal it's relation to what has been indexed and what not, or does it? So my question is, what's the relation with this and how does fscrawler determine, what other files do still have to be indexed?

In any way, fscrawler would have to query ES if the file it is about to index has already been indexed, but I just can't wrap my head around, as of what the lastrun value would have to do with that. Maybe it's to constrain the search in the index, but this only seems to have a limited benefit.

However, if the _status.json ist missing, it seems that ES doesn't even get checked for already indexed files and all files are indexed again… this is what --restart is supposed to do, I guess, but this behaviour totally denies to have another run on after an unsuccessful one, without having to only index the files, that didn't make it in the previous run. This is very cumbersome on some 14TB volume…

Hope this makes sense… ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is _status.json supposed to relate to what work has already been done? #828

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How is _status.json supposed to relate to what work has already been done? #828

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions