-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Issue Description
I am using two DOMSplitter subsequently in my configuration. The configuration runs on pages with the ContentType text/html. However, the child documents created by the first DOMSplitter are assigned the default ContentType application/octet-stream, even though they are clearly html fragments. As the DOMSplitter only runs on content of type text/html, the second DOMSplitter is completely ignored.
Current Workaround
I have to manually override the ContentType of the child documents with the value text/html.
Suggestion
(1). As far as I see, the children created by the DOMSplitter will always be HTML documents themselves. We can simply initialize them with the ContentType of their parent document. Thus the contentTypeDetector won't have to look at them.
(2). Do not treat the DOM Selector as if it was a filename. The reference of a document is not always a name of a file, so we would need to distinguish cases.
Further information
The child documents created by the DOMSplitter carry no ContentType (=null). For that reason, the automatic ContentType-Detection mechanism is subsequently executed on them (Importer.java, line 227). The contentTypeDetector uses the reference (which will be the DOM Selector previously given to the DOMSplitter) of the document as if it was a filename. It will try to extract a file ending out of a DOMSelector (so it catches any CSS class in the DOM Selector...). This greatly misleads the contentTypeDetector and it ends up classifying the document as application/octet-stream.