📦 NODE.JS OCR PDF Document Extractor

Hệ thống nhận diện và trích xuất thông tin từ tài liệu logistics sử dụng OCR và AI

🚀 Tổng quan

Ứng dụng Node.js sử dụng Express framework kết hợp công nghệ OCR (Tesseract.js) và Google Gemini AI 2.5 để tự động nhận diện và trích xuất thông tin từ các tài liệu PDF trong lĩnh vực logistics.

✨ Tính năng

📄 OCR Processing: Nhận diện văn bản từ file PDF sử dụng Tesseract.js
🤖 AI Extraction: Trích xuất thông tin thông minh với Google Gemini 2.5
🏷️ Field Detection: Tự động phát hiện các trường thông tin logistics
📊 Structured Output: Xuất dữ liệu có cấu trúc JSON
⚡ RESTful API: API dễ sử dụng và tích hợp

🛠️ Công nghệ sử dụng

Runtime: Node.js
Framework: Express.js
OCR Engine: Tesseract.js (bridge từ Tesseract C++ của Google)
AI Model: Google Gemini 2.5 Flash
PDF Processing: pdf-parse, pdf-poppler
Language: JavaScript/TypeScript

📋 Yêu cầu hệ thống

Node.js >= 18.x
npm hoặc yarn
Google Gemini API Key
RAM >= 2GB (cho xử lý OCR)

📦 Cài đặt

1. Clone repository

git clone <repository-url>
cd logistics-ocr-extractor

2. Cài đặt dependencies

npm install

3. Cấu hình môi trường

Tạo file .env trong thư mục root:

PORT=3000
GEMINI_API_KEY=your_gemini_api_key_here
UPLOAD_DIR=./uploads
MAX_FILE_SIZE=10485760
NODE_ENV=development

4. Cài đặt Tesseract (tùy chọn cho môi trường production)

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-vie

macOS:

brew install tesseract tesseract-lang

Windows: Tải từ GitHub Tesseract

🚀 Sử dụng

Khởi động server

# Development mode
npm run dev

# Production mode
npm start

Server sẽ chạy tại http://localhost:3000

API Endpoints

1. Upload và xử lý PDF

POST /api/ocr/extract
Content-Type: multipart/form-data

Request:

curl -X POST http://localhost:3000/api/ocr/extract \
  -F "file=@shipping_document.pdf" \
  -F "language=vie+eng"

Response:

{
  "success": true,
  "data": {
    "ocrText": "Full extracted text...",
    "extractedFields": {
      "billOfLading": "BOL123456",
      "shipperName": "ABC Logistics Co.",
      "consigneeName": "XYZ Import Co.",
      "origin": "Ho Chi Minh City, Vietnam",
      "destination": "Los Angeles, USA",
      "containerNumber": "ABCU1234567",
      "sealNumber": "SL789456",
      "cargoDescription": "Electronic Components",
      "weight": "15,000 KG",
      "volume": "28 CBM",
      "shipmentDate": "2025-09-15",
      "deliveryDate": "2025-10-20",
      "portOfLoading": "Cat Lai Port",
      "portOfDischarge": "Port of Long Beach",
      "incoterms": "FOB",
      "freightAmount": "3,500 USD"
    }
  },
  "processingTime": "5.2s"
}

2. Health Check

GET /api/health

Response:

{
  "status": "OK",
  "timestamp": "2025-10-01T10:30:00Z",
  "uptime": 3600
}

📂 Cấu trúc thư mục

logistics-ocr-extractor/
├── src/
│   ├── controllers/
│   │   └── ocrController.js
│   ├── services/
│   │   ├── ocrService.js
│   │   ├── geminiService.js
│   │   └── pdfService.js
│   ├── middleware/
│   │   ├── uploadMiddleware.js
│   │   └── errorHandler.js
│   ├── routes/
│   │   └── ocrRoutes.js
│   ├── utils/
│   │   ├── logger.js
│   │   └── validators.js
│   └── app.js
├── uploads/
├── .env
├── .gitignore
├── package.json
└── README.md

🔧 Cấu hình nâng cao

Tùy chỉnh OCR

Chỉnh sửa trong src/services/ocrService.js:

const ocrConfig = {
  lang: "vie+eng", // Ngôn ngữ
  oem: 3, // OCR Engine Mode
  psm: 6, // Page Segmentation Mode
};

Tùy chỉnh Gemini Prompt

Chỉnh sửa trong src/services/geminiService.js để extract các field tùy chỉnh:

const prompt = `
Extract logistics information from the following text:
- Bill of Lading Number
- Container Number
- Shipper/Consignee
- Origin/Destination
- Cargo details
... (thêm các field cần thiết)
`;

📊 Các trường thông tin logistics được trích xuất

Field	Mô tả	Ví dụ
🔢 Bill of Lading	Số vận đơn	BOL123456
📦 Container Number	Số container	ABCU1234567
🏢 Shipper	Người gửi hàng	ABC Logistics
🏭 Consignee	Người nhận hàng	XYZ Import Co.
📍 Origin	Điểm xuất phát	Ho Chi Minh City
🎯 Destination	Điểm đến	Los Angeles
⚖️ Weight	Trọng lượng	15,000 KG
📐 Volume	Thể tích	28 CBM
📅 Shipment Date	Ngày gửi hàng	2025-09-15
🚢 Port of Loading	Cảng đi	Cat Lai Port
🏁 Port of Discharge	Cảng đến	Port of Long Beach
💰 Freight Amount	Phí vận chuyển	3,500 USD

🧪 Testing

# Chạy unit tests
npm test

# Chạy với coverage
npm run test:coverage

# Test API với sample file
npm run test:api

🐛 Debug

Bật debug mode trong .env:

DEBUG=true
LOG_LEVEL=debug

Xem logs:

tail -f logs/app.log

🔒 Bảo mật

✅ Validate file upload (type, size)
✅ Sanitize input data
✅ Rate limiting
✅ API key encryption
✅ Secure file storage
✅ CORS configuration

📈 Performance

OCR Processing: ~3-5s cho file PDF 5 trang
AI Extraction: ~1-2s với Gemini 2.5
Throughput: ~10-15 requests/phút (tùy hardware)

🤝 Đóng góp

Mọi đóng góp đều được chào đón! Vui lòng:

Fork repository
Tạo feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Mở Pull Request

📝 License

Dự án được phân phối dưới giấy phép MIT. Xem file LICENSE để biết thêm chi tiết.

👥 Tác giả

Your Name - Initial work

🙏 Acknowledgments

Tesseract.js - OCR library
Google Gemini - AI model
Express.js - Web framework

📞 Liên hệ

📧 Email: your.email@example.com
🌐 Website: https://your-website.com
💼 LinkedIn: your-linkedin-profile

🔄 Changelog

Version 1.0.0 (2025-10-01)

✨ Initial release
🎉 OCR integration với Tesseract.js
🤖 Gemini AI 2.5 integration
📦 Support logistics document extraction

⭐ Nếu project này hữu ích, đừng quên cho một star nhé!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
eng.traineddata		eng.traineddata
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

📦 NODE.JS OCR PDF Document Extractor

🚀 Tổng quan

✨ Tính năng

🛠️ Công nghệ sử dụng

📋 Yêu cầu hệ thống

📦 Cài đặt

1. Clone repository

2. Cài đặt dependencies

3. Cấu hình môi trường

4. Cài đặt Tesseract (tùy chọn cho môi trường production)

🚀 Sử dụng

Khởi động server

API Endpoints

1. Upload và xử lý PDF

2. Health Check

📂 Cấu trúc thư mục

🔧 Cấu hình nâng cao

Tùy chỉnh OCR

Tùy chỉnh Gemini Prompt

📊 Các trường thông tin logistics được trích xuất

🧪 Testing

🐛 Debug

🔒 Bảo mật

📈 Performance

🤝 Đóng góp

📝 License

👥 Tác giả

🙏 Acknowledgments

📞 Liên hệ

🔄 Changelog

Version 1.0.0 (2025-10-01)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages