Programming Guide

About The Project

This is a project that aims to provide you with answers for your programming questions by finding and displaying similar questions from stackoverflow and any similar articles from medium.

It uses stackoverflow data as well as medium articles data available on kaggle for the recommendations.

The sentences are embedded using GloVe embeddings, data is filtered on the basis of tags provided and then cosine similarity is used to get top n most similar stackoverflow questions and medium articles.

Motivation

Instead of doing multiple searches on stackoverflow, medium, geeksforgeeks etc, I have always wanted a one-stop solution for any programming related question that I have had like "how to reverse a string in javascript".

Even though, a simple google search already achieves this but I still wanted to understand how it does that. Ever since I started programming, I have always had a dream to create a search engine/recommendation engine only for programming. This project is a very small proof of concept that its indeed possible to achieve that.

Flow

Built With

Frontend: ReactJS
Backend: FastAPI
Model: GloVe + Cosine Similarity
Web Server: Nginx

Data Source

Both the datasets were sourced from kaggle.

Stackoverflow: StackSample: 10% of Stack Overflow Q&A
Medium: 190k+ Medium Articles

(back to top)

Project Structure

├── backend/
│   ├── model/
│   │   ├── config.py
│   │   ├── model.py
│   │   └── preprocess.py
│   ├── server/
│   │   └── api.py
│   ├── .dockerignore
│   ├── .gitignore
│   ├── Dockerfile
│   └── requirements.txt
├── client/
│   ├── nginx/
│   │   ├── default.conf
│   │   └── Dockerfile
│   ├── node_modules
│   ├── public
│   ├── src
│   ├── .dockerignore
│   ├── .gitignore
│   ├── Dockerfile
│   ├── package-lock.json
│   └── package.json
├── nginx/
│   ├── default.conf
│   └── Dockerfile
├── src/
│   ├── app.png
│   ├── banner.png
│   ├── flow.jpg
│   ├── medium_demo.png
│   └── stackoverflow_demo.png
├── .gitignore
├── docker-compose.yml
└── README.md

Getting Started

The project is dockerized and makes use of docker compose. Therefore the only thing that you need in order to run the project is docker.

Prerequisites

Docker: How to install docker?

Installation

Get kaggle API key. How to get a kaggle API key?
Store kaggle API credentials as an .env file in the root directory. It should be in the following format. Don't include any spaces.
```
KAGGLE_USERNAME=your_kaggle_username
KAGGLE_KEY=your_api_key
```
Run docker compose in the root directory.
```
docker compose up
```

The project should be up when you visit localhost.

Note: First run might take time because it downloads the data, loads it in memory, pre-processes it and finally stores it for further use.

About Config

Following variables can hugely affect the quality of recommendations. These can be changed in the config.py.

SIMILARITY_THRESHOLD: Is used to compare how similar the input question should be to the title of the related stackoverflow questions as well as title of the medium articles.

It can range from 0 to 100
Increase this to increase quality of recommendation
Increasing it too much might lead to no results in some cases.

PERCENTAGE_MEDIUM_DATA: Amount of medium articles data to use. If you wish to change this, don't forget to delete the preprocessed csv file for medium articles found in ./backend/model/data/medium_articles/medium_processed.csv

It can range from 0 to 100
Increasing this might lead to increase in quality of recommendations but willwill also increase the computation in terms of RAM and CPU.

PERCENTAGE_STACKOVERFLOW_DATA: Amount of stackoverflow data to use. If you wish to change this, don't forget to delete the preprocessed csv file for stackoverflow questions found in ./backend/model/data/stackoverflow/stack_questions_processed.csv

It can range from 0 to 100
Increasing this might lead to increase in quality of recommendations but will also increase the computation in terms of RAM and CPU.

GLOVE_EMBEDDINGS_FILE_NAME: The type of GloVe embeddings to use. There are four options available for this. The varying number (50, 100, 200, 300) indicates the size/length of the vector that each word gets represented by.

Choosing a higher dimensional embeddings might lead to better recommendations but will again require more computation in terms of RAM and CPU.

Options are:

glove.6B.50d.txt
glove.6B.100d.txt
glove.6B.200d.txt
glove.6B.300d.txt

(back to top)

Usage

App

Question: Any programming question.

Language: Programming language associated with the question.

Count: Number of recommendations to get each for stackoverflow and medium.

Output

Stackoverflow

Medium

References

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Programming Guide

Table of Contents

About The Project

Motivation

Flow

Built With

Data Source

Project Structure

Getting Started

Prerequisites

Installation

About Config

Usage

App

Output

Stackoverflow

Medium

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
backend		backend
client		client
nginx		nginx
src/images		src/images
.gitignore		.gitignore
Dockerfile.koyeb		Dockerfile.koyeb
README.md		README.md
docker-compose.yml		docker-compose.yml

sid6i7/programming-guide

Folders and files

Latest commit

History

Repository files navigation

Programming Guide

Table of Contents

About The Project

Motivation

Flow

Built With

Data Source

Project Structure

Getting Started

Prerequisites

Installation

About Config

Usage

App

Output

Stackoverflow

Medium

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages