creation and management of personal scientific corpora, often created by downloading from Open repositories
semantic_corpus is a Python tool designed for researchers to create and manage personal scientific corpora. It automates the process of searching, downloading, and organizing scientific papers from open-access repositories.
- Multi-Repository Search: Seamlessly search across Europe PMC and arXiv.
- Automated Downloads: Bulk download papers in multiple formats (PDF, XML, etc.).
- Corpus Management: Organize your research into structured, searchable corpora.
- BAGIT Support: Optional long-term preservation using the BAGIT standard.
- Flexible Configuration: Use YAML files to manage complex search and download tasks.
- Python 3.8+
- Libraries:
requests,beautifulsoup4,lxml,tqdm,configargparse,pyyaml.
# Clone the repository
git clone https://github.com/semanticClimate/semantic_corpus.git
cd semantic_corpus
# Install the package
pip install .For development and testing:
pip install -e ".[dev]"# Create a new corpus
semantic_corpus create --name "MyResearch"
# Search and download papers
semantic_corpus download --query "climate change" --repository europe_pmc --limit 5 --formats "pdf,xml"The semantic_corpus CLI provides several subcommands. Use semantic_corpus [command] --help for more details.
-c, --config PATH: Path to a YAML configuration file.-v, --verbose: Enable verbose output for debugging.
Initialize a new structured corpus directory.
| Flag | Short | Description | Default |
|---|---|---|---|
--name |
-n |
(Required) The name of the corpus. | N/A |
--path |
-p |
Specific directory path for the corpus. | temp/corpus/{name} |
--verbose |
-v |
Enable verbose output. | False |
Search for papers without downloading them. Results are saved to a JSON file.
| Flag | Short | Description | Default |
|---|---|---|---|
--query |
-q |
(Required) Search query string. | N/A |
--repository |
-r |
Data source (europe_pmc, arxiv). |
europe_pmc |
--limit |
-l |
Maximum number of results to return. | 10 |
--output |
-o |
Directory to save search results. | temp/downloads |
--verbose |
-v |
Enable verbose output. | False |
Search for and download papers in specified formats.
| Flag | Short | Description | Default |
|---|---|---|---|
--query |
-q |
(Required) Search query string. | N/A |
--repository |
-r |
Data source (europe_pmc, arxiv). |
europe_pmc |
--limit |
-l |
Maximum number of results to return. | 10 |
--formats |
-f |
Comma-separated file formats (pdf, xml). |
xml,pdf |
--output |
-o |
Directory to save downloaded files. | temp/downloads |
--verbose |
-v |
Enable verbose output. | False |
Manage your tasks efficiently using a configuration file:
# config.yaml
query: "artificial intelligence"
repository: "arxiv"
limit: 50
formats:
- pdf
- xml
output: "./my_downloads"Run with:
semantic_corpus download --config config.yamlsemantic_corpus/
├── core/ # Corpus management and repository interfaces
├── repositories/ # Implementation for arXiv and Europe PMC
├── storage/ # BAGIT and storage handlers
├── cli.py # Command-line interface
└── utils.py # Shared utility functions
We use pytest for testing.
# Run all tests
pytest