Skip to content

Links aren't extracted when they aren't in plain-text strings, i.e. href only #7

@ross-spencer

Description

@ross-spencer

A href link but obfuscated by "some other text", e.g. `some other text does not seem to be picked up in the plain-text analysis from Tika. It does look like it can be picked up in Tika's "structured" output. That being said, it is unclear the side-effects of using this other method, e.g. could other structural links be surfaced such as XML schema links and so forth?

I will create a sample using Google Docs.

Kelly P has also mentioned link-rot on Antistatic and their slides file looks a good candidate to analyze for fun and for this particular issue: here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions