Skip to content

DOMSplitter's children have wrong ContentType #95

@mariuspruski

Description

@mariuspruski

Issue Description
I am using two DOMSplitter subsequently in my configuration. The configuration runs on pages with the ContentType text/html. However, the child documents created by the first DOMSplitter are assigned the default ContentType application/octet-stream, even though they are clearly html fragments. As the DOMSplitter only runs on content of type text/html, the second DOMSplitter is completely ignored.

Current Workaround
I have to manually override the ContentType of the child documents with the value text/html.

Suggestion
(1). As far as I see, the children created by the DOMSplitter will always be HTML documents themselves. We can simply initialize them with the ContentType of their parent document. Thus the contentTypeDetector won't have to look at them.
(2). Do not treat the DOM Selector as if it was a filename. The reference of a document is not always a name of a file, so we would need to distinguish cases.

Further information
The child documents created by the DOMSplitter carry no ContentType (=null). For that reason, the automatic ContentType-Detection mechanism is subsequently executed on them (Importer.java, line 227). The contentTypeDetector uses the reference (which will be the DOM Selector previously given to the DOMSplitter) of the document as if it was a filename. It will try to extract a file ending out of a DOMSelector (so it catches any CSS class in the DOM Selector...). This greatly misleads the contentTypeDetector and it ends up classifying the document as application/octet-stream.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions