Skip to content

This project collects Naver News comments to build a machine learning model that distinguishes between human-written and bot-generated comments, based on patterns inspired by the Dead Internet Theory. It involves web crawling, data storage, and analysis to enable accurate bot detection.

License

Notifications You must be signed in to change notification settings

RealSan1/Machine-Learning-for-Bot-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

40 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๋‰ด์Šค ๋Œ“๊ธ€ ์ง„์œ„(๋ด‡ / ์ธ๊ฐ„) ํŒ๋ณ„ ๋ชจ๋ธ ์—ฐ๊ตฌ ํ”„๋กœ์ ํŠธ

๋ณธ ํ”„๋กœ์ ํŠธ๋Š” ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๋Œ“๊ธ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋Œ“๊ธ€์ด ์ธ๊ฐ„(ํœด๋จผ)์ด ์ž‘์„ฑํ•œ ๊ฒƒ์ธ์ง€ ์ž๋™ ์ƒ์„ฑ(๋ด‡)๋œ ๊ฒƒ์ธ์ง€ ํŒ๋ณ„ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
์‹ค์ œ ์‚ฌ์šฉ์ž ๋Œ“๊ธ€(์ธ๊ฐ„ ๋ฐ์ดํ„ฐ)๊ณผ Ollama ๊ธฐ๋ฐ˜ LLM์œผ๋กœ ์ƒ์„ฑํ•œ ๋ด‡ ๋Œ“๊ธ€์„ ์ˆ˜์ง‘ยท์ƒ์„ฑํ•˜์—ฌ ๋น„๊ต ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.


ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

  • ๋ชฉํ‘œ: ์ธ๊ฐ„/๋ด‡ ๋Œ“๊ธ€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ฐœ๋ฐœ
  • ๋ฐ์ดํ„ฐ ์†Œ์Šค:
    • ์ธ๊ฐ„ ๋Œ“๊ธ€ 1942๊ฐœ: ๋„ค์ด๋ฒ„ ๊ณต์‹ ๋Œ“๊ธ€ API (cbox) ์ˆ˜์ง‘
    • ๋ด‡ ๋Œ“๊ธ€ 1982๊ฐœ: Ollama gpt-oss:20b ๋ชจ๋ธ๋กœ ๊ธฐ์‚ฌ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ + ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง•
  • ์ตœ์ข… ์‚ฐ์ถœ๋ฌผ: judge ๋ ˆ์ด๋ธ” ํฌํ•จ ๋ฐ์ดํ„ฐ์…‹ โ†’ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต

๊ธฐ์ˆ  ์Šคํƒ

Category Technology
์–ธ์–ด Python 3.10
LLM Ollama (gpt-oss:20b)
DB MySQL + SQLAlchemy (ORM)
ํฌ๋กค๋ง Requests, BeautifulSou

๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ

NEWS ํ…Œ์ด๋ธ” (๊ธฐ์‚ฌ ์ •๋ณด)

Column Type Description
newID INT PK ๋‰ด์Šค ์‹๋ณ„ ID
Title LONGTEXT ๊ธฐ์‚ฌ ์ œ๋ชฉ
Content LONGTEXT ๊ธฐ์‚ฌ ๋ณธ๋ฌธ ๋‚ด์šฉ

COMMENT ํ…Œ์ด๋ธ” (๋Œ“๊ธ€ ์ •๋ณด)

Column Type Description
commentID INT PK ๋Œ“๊ธ€ ์‹๋ณ„ ID
newID INT FK ์—ฐ๊ฒฐ๋œ ๋‰ด์Šค ID (NEWS.newID)
comment LONGTEXT ๋Œ“๊ธ€ ๋‚ด์šฉ
judge INT 0 = ํœด๋จผ ๋Œ“๊ธ€, 1 = ๋ด‡ ๋Œ“๊ธ€ (๋ ˆ์ด๋ธ”)

์ „์ฒด ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ

graph TD
    A[๋‰ด์Šค ๊ธฐ์‚ฌ ํฌ๋กค๋ง<br>Title + Content] --> B[(MySQL: NEWS)]
    C[๋„ค์ด๋ฒ„ cbox API<br>์‹ค์ œ ๋Œ“๊ธ€ ์ˆ˜์ง‘] --> D[(MySQL: COMMENT, judge=0)]
    B --> E[Ollama gpt-oss:20b<br>๊ธฐ์‚ฌ ๊ธฐ๋ฐ˜ 4๊ฐœ ๋Œ“๊ธ€ ์ƒ์„ฑ]
    E --> F[๊ฐ ๋Œ“๊ธ€๋ณ„ ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง•<br>์˜๋ฏธ ์œ ์ง€ + ์ž์—ฐ์–ด ๋ณ€ํ˜•]
    F --> G[(MySQL: COMMENT, judge=1)]
    D & G --> H[๋ฐ์ดํ„ฐ์…‹ ์™„์„ฑ<br>ํœด๋จผ + ๋ด‡ ๋Œ“๊ธ€]
    H --> I[ML ๋ชจ๋ธ ํ•™์Šต<br>์ง„์œ„ ํŒ๋ณ„]
Loading

์ฃผ์š” ๊ธฐ๋Šฅ ๋ฐ ๊ตฌํ˜„

1. ๋‰ด์Šค ๊ธฐ์‚ฌ ์ˆ˜์ง‘

  • ๋„ค์ด๋ฒ„ ๋‰ด์Šค ์„น์…˜ URL ์ˆ˜์ง‘
  • BeautifulSoup + Requests๋กœ ์ œ๋ชฉ + ๋ณธ๋ฌธ ํฌ๋กค๋ง
  • SQLAlchemy๋กœ NEWS ํ…Œ์ด๋ธ” ์ €์žฅ

2. ์‹ค ์‚ฌ์šฉ์ž ๋Œ“๊ธ€ ์ˆ˜์ง‘

  • ๋„ค์ด๋ฒ„ ๊ณต์‹ cbox API ์‚ฌ์šฉ
  • moreParam.next ํŽ˜์ด์ง• ์ฒ˜๋ฆฌ๋กœ ๋‹ค์ค‘ ํŽ˜์ด์ง€ ๋กœ๋”ฉ
  • judge=0์œผ๋กœ COMMENT ํ…Œ์ด๋ธ” ์ €์žฅ

3. ๋ด‡ ๋Œ“๊ธ€ ์ƒ์„ฑ (gpt-oss:20b)

import ollama

def generate_comment(article):
    prompt = f"""
    ๋‹ค์Œ ๊ธฐ์‚ฌ ์ œ๋ชฉ๊ณผ ๋ณธ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ๊ณต๊ฐ ๊ฐ€๋Šฅํ•œ ๋Œ“๊ธ€์„ 4๊ฐœ ์ž‘์„ฑ,
    ๋Œ“๊ธ€์˜ ๊ธธ์ด๋Š” ์ตœ๋Œ€ 100 ์ตœ์†Œ 40 ๊ธธ์ด๋กœ ๋‹ค์–‘ํ•˜๊ฒŒ ์ž‘์„ฑ, ํŠน์ˆ˜๊ธฐํ˜ธ ์‚ฌ์šฉ ๊ธˆ์ง€.
    ์ œ๋ชฉ: {article['title']}
    ๋ณธ๋ฌธ: {article['content']}
    """
    response = ollama.chat(
        model="gpt-oss:20b",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"].strip()

4. ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง• (์ž์—ฐ์Šค๋Ÿฌ์›€ + ๋‹ค์–‘์„ฑ ํ™•๋ณด)

def paraphrase_comment(comment_text):
    prompt = f"""
    ๋‹ค์Œ ๋Œ“๊ธ€์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ฐ”๊พธ๋˜, ์˜๋ฏธ๋Š” ์œ ์ง€ํ•˜์„ธ์š”.
    ๊ธธ์ด๋Š” ์ตœ๋Œ€ 100์ž, ์ตœ์†Œ 40์ž๋กœ ๋‹ค์–‘ํ•˜๊ฒŒ ์ž‘์„ฑํ•˜๊ณ , ํŠน์ˆ˜๊ธฐํ˜ธ ์‚ฌ์šฉ ๊ธˆ์ง€:
    ๋Œ“๊ธ€: {comment_text}
    """
    response = ollama.chat(
        model="gpt-oss:20b",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"].strip()
  • ๊ฐ ์ƒ์„ฑ ๋Œ“๊ธ€ โ†’ 1ํšŒ ์ด์ƒ ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง•
  • ์˜๋ฏธ ์œ ์ง€ ์—ฌ๋ถ€ ๋ฐ ์ž์—ฐ์Šค๋Ÿฌ์›€ ๊ฒ€์ฆ ํ›„ ์ €์žฅ
์›๋ณธ ๋Œ“๊ธ€ ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง•๋œ ๋Œ“๊ธ€
์ •๋‹นํ•œ ์กฐ์‚ฌ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค. ๊ตฐ ์ธ๋ ฅ ์ ˆ๊ฐ์ด ์•„๋‹Œ ์ •๋‹นํ•œ ์ˆ˜์‚ฌ๋ผ๋Š” ์ ์ด ์ค‘์š”ํ•ด์š”. ์ •๋‹นํ•œ ์กฐ์‚ฌ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ตฐ ์ธ๋ ฅ ์ ˆ๊ฐ์ด ์•„๋‹ˆ๋ผ ์‹ค์ œ ์ˆ˜์‚ฌ์— ์ค‘์ ์„ ๋‘๋Š” ์ ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
์ €๋Š” ํ™ ์ „ ์‹œ์žฅ์ด ๋งํ•œ ํƒˆ์ถœ์ด ์ •๋ง ํ•„์š”ํ–ˆ๋˜ ์ˆœ๊ฐ„์ด๋ผ๊ณ  ์ƒ๊ฐํ•ด์š” ์ €๋Š” ํ™ ์ „ ์‹œ์žฅ์ด ๋งํ•œ ํƒˆ์ถœ์ด ๊ทธ๋•Œ ์ •๋ง ํ•„์š”ํ–ˆ๋˜ ์ˆœ๊ฐ„์ด๋ผ๊ณ  ์ƒ๊ฐํ•ด์š”

5. ๋ฐ์ดํ„ฐ ์ €์žฅ

  • ์ƒ์„ฑ/ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง• ๋Œ“๊ธ€ โ†’ judge=1๋กœ ์ €์žฅ
  • judge ํ•„๋“œ๋ฅผ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ์ •๋‹ต ๋ ˆ์ด๋ธ”๋กœ ์‚ฌ์šฉ

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹คํ–‰ ์ˆœ์„œ

  • ๋‰ด์Šค ๊ธฐ์‚ฌ ์ˆ˜์ง‘ โ†’ NEWS ์ €์žฅ
  • ์‹ค์ œ ๋Œ“๊ธ€ ์ˆ˜์ง‘ โ†’ COMMENT (judge=0)
  • ๋ด‡ ๋Œ“๊ธ€ ์ƒ์„ฑ (4๊ฐœ/๊ธฐ์‚ฌ)
  • ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ง• ์ˆ˜ํ–‰
  • ๋ด‡ ๋Œ“๊ธ€ ์ €์žฅ (judge=1)
  • ๋ฐ์ดํ„ฐ์…‹ ์™„์„ฑ โ†’ ML ํ•™์Šต ์ค€๋น„

๋ฐ์ดํ„ฐ์…‹ ์ƒ˜ํ”Œ

news ํ…Œ์ด๋ธ”

newID Title Content
3 ์ •๋ถ€, ์•”ํ‘œ์— ์นผ ๋นผ๋“ ๋‹คโ€ฆ"๊ณผ์ง•๊ธˆ ์ตœ๋Œ€ 30๋ฐฐ, ์‹ ๊ณ ์ž์—๊ฒ ํฌ์ƒ๊ธˆ" 7์ผ ์ž ์‹ค์•ผ๊ตฌ์žฅ์—์„œ ์—ด๋ฆฌ๋Š” 2025KBO๋ฆฌ๊ทธ LGํŠธ์œˆ์Šค์™€ SSG์ƒ๋žต
13 ๋‚˜๊ฒฝ์›, ๋ผ๋””์˜ค ์ƒ๋ฐฉ์„œ ์ง„ํ–‰์ž์— "์ •์„ฑํ˜ธ ๋Œ€๋ณ€์ธ์ด๋ƒ" ๋ฐœ๋ˆ "ๆชข ๋Œ€์žฅ๋™ ํ•ญ์†Œ ํฌ๊ธฐ๋Š” ์™ธ์••โ€ฆ์ • ์žฅ๊ด€ ์ž…์žฅ์€ ์ƒ๋žต

comment ํ…Œ์ด๋ธ”

commentID newID comment judge
3082 3 ์ €๋Š” ์•”ํ‘œ๊ฐ€ ํ”ํ•œ ์‚ฌํšŒ ๋ฌธ์ œ๋ผ์„œ ์ด๋ฒˆ ์ •๋ถ€ ์ •์ฑ…์— ํฌ๋ง์„ ๋А๊ผˆ์–ด์š” ๊ทธ๋ฆฌ๊ณ  ์•ž์œผ๋กœ๋„ ์ž…์žฅ๊ถŒ ๊ฐ€๊ฒฉ์ด ์ƒ์Šนํ•ด๋„ ์ฐจ๋ผ๋ฆฌ ์ •๊ฐ€๋ฅผ ์ง€์ผœ์•ผ ํ•  ๊ฒƒ ๊ฐ™์•„์š” 1
3832 3 ์ •๋ถ€๊ฐ€ ์•”ํ‘œ๋ฅผ ์žก๋Š” ๋ฐฉ์‹์ด ๊ณผ์ง•๊ธˆ์— ์ง‘์ค‘ํ•˜๋Š” ์ ์ด ์ธ์ƒ์ ์ด๋„ค์š” ํ•˜์ง€๋งŒ ์‹ ๊ณ ์ž ํฌ์ƒ๋„ ๊ผญ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ด์š” 1
100 13 ๋‚˜๊ฒฝ์›์€ ํ•ญ์ƒ ๋…ผ๋ฆฌ๊ฐ€ ๋”ธ๋ฆฌ๋‹ˆ๊นŒ ๋งˆ์ง€๋ง‰์€ ๊ผญ ์ธ์‹ ๊ณต๊ฒฉ์ด์ง€. ๋‚˜๊ฒฝ์›์ด ๊ตญํž˜์„ ๋Œ€ํ‘œํ•˜๊ณ  ๋‚˜์„œ๋Š” ์‹œ๊ธฐ ๋™์•ˆ ๊ตญํž˜์ด ์ž˜ ๋œ ์ ์ด ์—†๋‹ค 0
113 13 ์ œ2์˜ ํ™์ค€ํ‘œ๋‚ฉ์‹œ์—ˆ๋„ค.. ์ง„ํ–‰์ž๊ฐ€ ๋จผ์งˆ๋ฌธ์ด๋“  ํ• ์ˆ˜ ์žˆ๊ณ , ๋Œ€๋‹ต์ž์‹ ์—†์œผ๋ฉด ๋ชจ๋ฅธ๋‹ค๊ณ  ํ•˜๋ฉด ๋ ์ผ์ด์ง€, ์ง„ํ–‰์ž๋ฅผ ๊ณต๊ฒฉํ•˜๋Š” ์น˜์‚ฌํ•œ์ง“์„ ํ•˜๋ƒ? ๋‹ฌ์„ ๋ด์•ผ์ง€ ์†๊ฐ€๋ฝ์„ ์ง€์ ์งˆํ•ด๋Œ€๋ฉด ๋˜๊ฒƒ๋ƒ? 5์„ ์”ฉ์ด๋‚˜ ๋œ๋‹ด์„œ.. ์ดˆ์„ ๊ฐ™์€์ง“์„ ํ•˜๋‹ค๋‹ˆ... 0

Sentence-BERT ์ž„๋ฒ ๋”ฉ ์ ์šฉ ๊ฐœ์š”

์›๋ณธ CSV ํŒŒ์ผ์—๋Š” ๋‹ค์Œ ์ปฌ๋Ÿผ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋‰ด์Šค์ œ๋ชฉ (title)
  • ๋‰ด์Šค๋‚ด์šฉ (content)
  • ๋Œ“๊ธ€ (comment)
  • ๋ด‡ํŒ๋‹จ (judge, 0: ํœด๋ฉด๋Œ“๊ธ€, 1: ๋ด‡๋Œ“๊ธ€)

์ž„๋ฒ ๋”ฉ ๊ณผ์ •

  1. ๋‰ด์Šค์ œ๋ชฉ, ๋Œ“๊ธ€ ํ…์ŠคํŠธ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ฌธ์žฅ ๋‹จ์œ„ ํ…์ŠคํŠธ ์ƒ์„ฑ
  2. ํ•œ๊ตญ์–ด์— ์ตœ์ ํ™”๋œ Sentence-BERT ๋ชจ๋ธ(snunlp/KR-SBERT-V40K-klueNLI-augSTS)์„ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
  3. ๊ฐ ํ…์ŠคํŠธ๋ณ„ 768์ฐจ์› ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœ
  4. ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์™€ ๋ด‡ํŒ๋‹จ ๋ ˆ์ด๋ธ”์„ ๋ณ„๋„๋กœ ์ €์žฅํ•˜์—ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉ

ํŠน์ง• ๋ฐ ์žฅ์ 

  • ๋ฌธ๋งฅ๊ณผ ๋ฌธ์žฅ ์˜๋ฏธ๋ฅผ ๋ฐ˜์˜ํ•œ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋‹จ์–ด ํ‰๊ท  ๋ฐฉ์‹ ๋Œ€๋น„ ํ‘œํ˜„๋ ฅ ํ–ฅ์ƒ
  • ๋‰ด์Šค์™€ ๋Œ“๊ธ€์„ ํ†ตํ•ฉํ•œ ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋Œ“๊ธ€ ๋งฅ๋ฝ ํŒŒ์•… ๊ฐ€๋Šฅ
  • ๋ด‡ํŒ๋‹จ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌ

ํ™œ์šฉ

  • ์ €์žฅ๋œ Sentence-BERT ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์™€ ๋ ˆ์ด๋ธ”์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹/๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต
  • ๋ด‡ ํƒ์ง€ ๋ชจ๋ธ์˜ ์ž…๋ ฅ ํŠน์ง•(feature)์œผ๋กœ ํ™œ์šฉ๋˜์–ด ์ •ํ™•๋„ ๋ฐ ์žฌํ˜„์œจ ๊ฐœ์„  ๊ธฐ๋Œ€

๋ชจ๋ธ

graph TD

    A[Input] --> B[Linear input_dim to 32]
    B --> C[BatchNorm1d 32]
    C --> D[SiLU]
    D --> E[Dropout 0.1]
    E --> F[Linear 32 to 1]
    F --> G[Output]

Loading

๋‹ค์ธต ํผ์…‰ํŠธ๋ก (Multi Layer Perceptron) ์ด์ง„๋ถ„๋ฅ˜

  • Layer: 5๊ฐœ
  • ์†์‹คํ•จ์ˆ˜: BCEWithLogitsLoss
  • ํ™œ์„ฑํ™”ํ•จ์ˆ˜: SiLU
  • Early Stopping (patience=20)
  • ๋ฐ์ดํ„ฐ ๋ถ„ํ• : Train 85% / Valid 7.5% / Test 7.5%

Train / Valid / Test = 3335 294 295

๋ชจ๋ธ ์„ฑ๋Šฅ

image

Accuracy : 0.9898 Precision: 1.0000 Recall : 0.9795 F1-score : 0.9896 ROC AUC : 0.9994

About

This project collects Naver News comments to build a machine learning model that distinguishes between human-written and bot-generated comments, based on patterns inspired by the Dead Internet Theory. It involves web crawling, data storage, and analysis to enable accurate bot detection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors