Skip to content

congdaoduy298/Crawl-Data

Repository files navigation

I. CRAW DATA

Craw data from Goodread by Selenium and Python.

Installation and Run

  1. Python3 .

  2. Clone this repository .

$ git clone https://github.com/congdaoduy298/Crawl-Data.git 
  1. Install dependencies .
$ cd Crawl-Data/

$ pip3 install -r requirements.txt 
  1. Run file by terminal .
$ python crawl_books.py

Result

Total running time: 6181s

II. NAMED ENTITY RECOGNITION

Get Vietnamese NER by using VnCoreNLP and English NER by using nltk + spacy.

Installation

  1. Python 3.4+ (< 3.8).

  2. Have to install all dependent libraries

$ pip3 install -r requirements.txt 
  1. Clone VnCoreNLP repository and install vncorenlp.
$ git clone https://github.com/vncorenlp/VnCoreNLP
  1. Java 1.8+

  2. File VnCoreNLP-1.1.1.jar (27MB) and folder models (115MB) are placed in the same working folder.

  3. NLTK Library (Do not need if use Bert-base model).

  4. Spacy Library (Do not need if use Bert-base model).

$ python3 -m spacy download en_core_web_sm

Run

I. Use NLTK and Spacy

  1. Run VnCoreNLP server.
$ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"
  1. Open new terminal.
$ python3 get_ner.py

II. Use Bert-base

  1. Get NER with Vietnamese sentences by VnCoreNLP.
$ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"

$ python3 get_vn_ner.py
  1. Use GPU of Google Colab and run all code in Bert_NER.ipynb.

REFERENCES

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

Named Entity Recognition with NLTK and SpaCy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published