-
Notifications
You must be signed in to change notification settings - Fork 3
Home
bconv has an API modeled after that of Python's pickle and json libraries.
In particular, there is a pair of top-level functions load/dump which convert between a format-specific serialisation and a Python representation in memory.
>>> import bconv
>>> with open('path/to/example.json', encoding='utf8') as f:
... coll = bconv.load(f, fmt='bioc_json')
>>> coll
<Collection with 37 documents at 0x7f1966e4b3c8>
>>> with open('path/to/example.conll', 'w', encoding='utf8') as f:
... bconv.dump(coll, f, fmt='conll', tagset='IOBES', include_offsets=True)Unlike json or pickle, bconv is not a parser/serialiser for a single format, but for a whole range of different formats.
The loaded contents aren't arbitrary Python types, but instances of a specific document model appropriate for representing annotated text.
From this internal representation, the contents can be exported into any of the supported formats.
The functions bconv.load, bconv.loads and bconv.fetch allow loading a document or collection of documents from a specific format into a Python representation.
bconv.load(
source: str|Path|IO,
fmt: str = None,
mode: str = 'native',
id: int|str = None,
**options) -> bconv.Collection|bconv.DocumentLoad a document or collection from a file.
sourcemay be a path or a readable file-like object. If it is an open file, its type (text or binary) must match the expectation of the respective format (cf. the stream type format property).fmtspecifies the format to use. It must be one of the format names listed here or inbconv.LOADERS. This parameter is semi-optional: ifsourceis a path and the file extension is the same as the format name (eg. _path/to/file.conll),fmtmay be omitted.modedetermines the return type:
"native": aDocumentorCollectionobject, depending on the format (cf. the native type format property);"collection": aCollectionobject wrapping all content;"lazy": an iterator ofDocumentobjects, consumed lazily if possible.idcan be an arbitrary identifier for the loaded document or collection. It will be accessible as an attributeidon the returned object. It is not particularly important to set this parameter (and it can even be automatically inferred from the content for some formats), but may be convenient in some cases.- Any format-specific
optionscan be passed as keyword arguments.
bconv.loads(
source: str|bytes,
fmt: str,
mode: str = 'native',
id: int|str = None,
**options) -> bconv.Collection|bconv.DocumentLoad a document or collection from a
strorbytesobject.This is a mere convenience function that internally wraps
sourcein anio.StringIOorio.BytesIOobject and passes it tobconv.load. The type ofsource(str/bytes) must match the expectation of the respective format (cf. the stream type format property). Thefmtparameter is not optional, as there is no obvious way to reliably guess the file format without a file name.
bconv.fetch(
query: str|Sequence[str],
fmt: str,
mode: str = 'native',
id: int|str = None,
**options) -> bconv.CollectionLoad a collection from a remote service.
Currently, PubMed abstracts and PMC articles can be fetched from NCBI's efetch service. Note: Even though it is not technically enforced, requests to NCBI should include the caller's e-mail address (specify it as a keyword parameter
queryspecifies the documents to retrieve. It is a sequence or comma-separated list of PubMed or PMC IDs. Note that non-existing IDs are silently skipped;bconvdoes not attempt to check the returned collection for completeness.fmtmay be"pubmed"or"pmc".
The functions bconv.dump and bconv.dumps serialise a loaded document or collection to disk or memory.
bconv.dump(
content: bconv.Collection|bconv.Document,
dest: str|Path|IO,
fmt: str = None,
**options)Serialise a document or collection to disk.
contentis aDocumentorCollectionobject to be serialised (see the note below for limitations to the choice of the type).destis the destination for writing the data, given as a path or a writable open file. If it is an open file, its type (text or binary) must match the expectation of the respective format (cf. the stream type format property). Ifdestis a path to an existing directory, a file name is constructed based oncontent.idor (if it isNone/empty)content.filename.fmtspecifies the format to use. It must be one of the format names listed here or inbconv.EXPORTERS. This parameter is semi-optional: ifdestis a path and the file extension is the same as the format name (eg. path/to/file.bionlp),fmtmay be omitted.- Any format-specific
optionscan be passed as keyword arguments.
bconv.dumps(
content: bconv.Collection|bconv.Document,
fmt: str,
**options) -> str|bytesReturn the serialisation of a document or collection as a
strorbytesobject.The parameters are the same as for
bconv.dump. Thefmtparameter is not optional though, as there is no way to guess. The type of the return value depends on the format (cf. the stream type format property).
Note: bconv.dump and bconv.dumps accept both Document and Collection objects.
However, not all formats are equally well suited to represent both levels.
For example, when serialising a collection to txt plain-text with brat stand-off annotations, the document boundaries are lost.
bconv's functions load, loads, fetch, dump and dumps are high-level wrappers around a hierarchy of loader and exporter classes, which can also be instantiated directly.