Craw data from Goodread by Selenium and Python.
-
Python3 .
-
Clone this repository .
$ git clone https://github.com/congdaoduy298/Crawl-Data.git
- Install dependencies .
$ cd Crawl-Data/
$ pip3 install -r requirements.txt
- Run file by terminal .
$ python crawl_books.py
Total running time: 6181s
Get Vietnamese NER by using VnCoreNLP and English NER by using nltk + spacy.
-
Python 3.4+ (< 3.8).
-
Have to install all dependent libraries
$ pip3 install -r requirements.txt
- Clone VnCoreNLP repository and install vncorenlp.
$ git clone https://github.com/vncorenlp/VnCoreNLP
-
Java 1.8+
-
File VnCoreNLP-1.1.1.jar (27MB) and folder models (115MB) are placed in the same working folder.
-
NLTK Library (Do not need if use Bert-base model).
-
Spacy Library (Do not need if use Bert-base model).
$ python3 -m spacy download en_core_web_sm
I. Use NLTK and Spacy
- Run VnCoreNLP server.
$ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"
- Open new terminal.
$ python3 get_ner.py
II. Use Bert-base
- Get NER with Vietnamese sentences by VnCoreNLP.
$ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"
$ python3 get_vn_ner.py
- Use GPU of Google Colab and run all code in Bert_NER.ipynb.