- About The Project
- Getting Started
- Usage
- App
- Output
- References
This is a project that aims to provide you with answers for your programming questions by finding and displaying similar questions from stackoverflow and any similar articles from medium.
It uses stackoverflow data as well as medium articles data available on kaggle for the recommendations.
The sentences are embedded using GloVe embeddings, data is filtered on the basis of tags provided and then cosine similarity is used to get top n most similar stackoverflow questions and medium articles.
Instead of doing multiple searches on stackoverflow, medium, geeksforgeeks etc, I have always wanted a one-stop solution for any programming related question that I have had like "how to reverse a string in javascript".
Even though, a simple google search already achieves this but I still wanted to understand how it does that. Ever since I started programming, I have always had a dream to create a search engine/recommendation engine only for programming. This project is a very small proof of concept that its indeed possible to achieve that.
- Frontend: ReactJS
- Backend: FastAPI
- Model: GloVe + Cosine Similarity
- Web Server: Nginx
Both the datasets were sourced from kaggle.
- Stackoverflow: StackSample: 10% of Stack Overflow Q&A
- Medium: 190k+ Medium Articles
├── backend/
│ ├── model/
│ │ ├── config.py
│ │ ├── model.py
│ │ └── preprocess.py
│ ├── server/
│ │ └── api.py
│ ├── .dockerignore
│ ├── .gitignore
│ ├── Dockerfile
│ └── requirements.txt
├── client/
│ ├── nginx/
│ │ ├── default.conf
│ │ └── Dockerfile
│ ├── node_modules
│ ├── public
│ ├── src
│ ├── .dockerignore
│ ├── .gitignore
│ ├── Dockerfile
│ ├── package-lock.json
│ └── package.json
├── nginx/
│ ├── default.conf
│ └── Dockerfile
├── src/
│ ├── app.png
│ ├── banner.png
│ ├── flow.jpg
│ ├── medium_demo.png
│ └── stackoverflow_demo.png
├── .gitignore
├── docker-compose.yml
└── README.md
The project is dockerized and makes use of docker compose. Therefore the only thing that you need in order to run the project is docker.
- Docker: How to install docker?
-
Get kaggle API key. How to get a kaggle API key?
-
Store kaggle API credentials as an .env file in the root directory. It should be in the following format. Don't include any spaces.
KAGGLE_USERNAME=your_kaggle_username KAGGLE_KEY=your_api_key -
Run docker compose in the root directory.
docker compose up
The project should be up when you visit localhost.
Note: First run might take time because it downloads the data, loads it in memory, pre-processes it and finally stores it for further use.
Following variables can hugely affect the quality of recommendations. These can be changed in the config.py.
SIMILARITY_THRESHOLD: Is used to compare how similar the input question should be to the title of the related stackoverflow questions as well as title of the medium articles.
- It can range from 0 to 100
- Increase this to increase quality of recommendation
- Increasing it too much might lead to no results in some cases.
PERCENTAGE_MEDIUM_DATA: Amount of medium articles data to use. If you wish to change this, don't forget to delete the preprocessed csv file for medium articles found in ./backend/model/data/medium_articles/medium_processed.csv
- It can range from 0 to 100
- Increasing this might lead to increase in quality of recommendations but willwill also increase the computation in terms of RAM and CPU.
PERCENTAGE_STACKOVERFLOW_DATA: Amount of stackoverflow data to use. If you wish to change this, don't forget to delete the preprocessed csv file for stackoverflow questions found in ./backend/model/data/stackoverflow/stack_questions_processed.csv
- It can range from 0 to 100
- Increasing this might lead to increase in quality of recommendations but will also increase the computation in terms of RAM and CPU.
GLOVE_EMBEDDINGS_FILE_NAME: The type of GloVe embeddings to use. There are four options available for this. The varying number (50, 100, 200, 300) indicates the size/length of the vector that each word gets represented by.
Choosing a higher dimensional embeddings might lead to better recommendations but will again require more computation in terms of RAM and CPU.
Options are:
- glove.6B.50d.txt
- glove.6B.100d.txt
- glove.6B.200d.txt
- glove.6B.300d.txt
Question: Any programming question.
Language: Programming language associated with the question.
Count: Number of recommendations to get each for stackoverflow and medium.



