Skip to content

Suggests relevant resources like stackoverflow posts, medium articles for a programming question

Notifications You must be signed in to change notification settings

sid6i7/programming-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Programming Guide

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
    • App
    • Output
  4. References

About The Project

This is a project that aims to provide you with answers for your programming questions by finding and displaying similar questions from stackoverflow and any similar articles from medium.

It uses stackoverflow data as well as medium articles data available on kaggle for the recommendations.

The sentences are embedded using GloVe embeddings, data is filtered on the basis of tags provided and then cosine similarity is used to get top n most similar stackoverflow questions and medium articles.

Motivation

Instead of doing multiple searches on stackoverflow, medium, geeksforgeeks etc, I have always wanted a one-stop solution for any programming related question that I have had like "how to reverse a string in javascript".

Even though, a simple google search already achieves this but I still wanted to understand how it does that. Ever since I started programming, I have always had a dream to create a search engine/recommendation engine only for programming. This project is a very small proof of concept that its indeed possible to achieve that.

Flow

App

Built With

Data Source

Both the datasets were sourced from kaggle.

(back to top)

Project Structure

├── backend/
│   ├── model/
│   │   ├── config.py
│   │   ├── model.py
│   │   └── preprocess.py
│   ├── server/
│   │   └── api.py
│   ├── .dockerignore
│   ├── .gitignore
│   ├── Dockerfile
│   └── requirements.txt
├── client/
│   ├── nginx/
│   │   ├── default.conf
│   │   └── Dockerfile
│   ├── node_modules
│   ├── public
│   ├── src
│   ├── .dockerignore
│   ├── .gitignore
│   ├── Dockerfile
│   ├── package-lock.json
│   └── package.json
├── nginx/
│   ├── default.conf
│   └── Dockerfile
├── src/
│   ├── app.png
│   ├── banner.png
│   ├── flow.jpg
│   ├── medium_demo.png
│   └── stackoverflow_demo.png
├── .gitignore
├── docker-compose.yml
└── README.md

Getting Started

The project is dockerized and makes use of docker compose. Therefore the only thing that you need in order to run the project is docker.

Prerequisites

Installation

  1. Get kaggle API key. How to get a kaggle API key?

  2. Store kaggle API credentials as an .env file in the root directory. It should be in the following format. Don't include any spaces.

    KAGGLE_USERNAME=your_kaggle_username
    KAGGLE_KEY=your_api_key
    
  3. Run docker compose in the root directory.

    docker compose up

The project should be up when you visit localhost.

Note: First run might take time because it downloads the data, loads it in memory, pre-processes it and finally stores it for further use.

About Config

Following variables can hugely affect the quality of recommendations. These can be changed in the config.py.

SIMILARITY_THRESHOLD: Is used to compare how similar the input question should be to the title of the related stackoverflow questions as well as title of the medium articles.

  • It can range from 0 to 100
  • Increase this to increase quality of recommendation
  • Increasing it too much might lead to no results in some cases.

PERCENTAGE_MEDIUM_DATA: Amount of medium articles data to use. If you wish to change this, don't forget to delete the preprocessed csv file for medium articles found in ./backend/model/data/medium_articles/medium_processed.csv

  • It can range from 0 to 100
  • Increasing this might lead to increase in quality of recommendations but willwill also increase the computation in terms of RAM and CPU.

PERCENTAGE_STACKOVERFLOW_DATA: Amount of stackoverflow data to use. If you wish to change this, don't forget to delete the preprocessed csv file for stackoverflow questions found in ./backend/model/data/stackoverflow/stack_questions_processed.csv

  • It can range from 0 to 100
  • Increasing this might lead to increase in quality of recommendations but will also increase the computation in terms of RAM and CPU.

GLOVE_EMBEDDINGS_FILE_NAME: The type of GloVe embeddings to use. There are four options available for this. The varying number (50, 100, 200, 300) indicates the size/length of the vector that each word gets represented by.

Choosing a higher dimensional embeddings might lead to better recommendations but will again require more computation in terms of RAM and CPU.

Options are:

  • glove.6B.50d.txt
  • glove.6B.100d.txt
  • glove.6B.200d.txt
  • glove.6B.300d.txt

(back to top)

Usage

App

App

Question: Any programming question.

Language: Programming language associated with the question.

Count: Number of recommendations to get each for stackoverflow and medium.


Output

Stackoverflow

App

Medium

App

References

(back to top)

About

Suggests relevant resources like stackoverflow posts, medium articles for a programming question

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published