Skip to content

MoritzM00/fall-detection-mllm

Repository files navigation

Video-based Fall Detection using Multimodal Large Language Models

License: MIT Python 3.11+ Ruff vLLM Hydra pre-commit

Project Overview

This project provides code for the master thesis on Video-based Fall Detection using Multimodal Large Language Models (MLLMs), specifically the detection of Human Falls and the subsequent state of being fallen. We also evaluate MLLMs jointly with general Human Activity classes like walking or standing to assess models on Human Activity Recognition (HAR).

The main experiments we conduct are:

  • Zero-shot: No exemplars are given, just the task instruction
  • Few-shot: Few (usually 1-10) video exemplars with associated ground truth label are supplied for In-Context Learning (ICL)
  • Chain-of-Thought (CoT): Specifically, Zero-Shot CoT, i.e. no exemplars with reasoning trace are given. The model can come up with its own reasoning trace.

Quick Start

Requirements:

  1. Setup Environment with conda/uv
  2. Set recommended environment variables

The main entrypoint is scripts/vllm_inference.py and experiments can be configured using e.g.,experiment=zeroshot (the default is debug)

To run zero-shot experiments with InternVL3.5-8B, execute

python scripts/vllm_inference.py experiment=zeroshot model=internvl model.params=8B

To run few-shot experiments with Qwen3-VL-4B, execute

python scripts/vllm_inference.py experiment=fewshot  model=qwenvl model.params=4B

To run CoT experiments with the default model, execute

python scripts/vllm_inference.py experiment=zeroshot_cot

Configuration options

Besides settings experiments, the main configuration options are

  1. vLLM configs in config/vllm (default: default, for faster warmup times, use debug)
  2. Sampling configs in config/sampling (i.e. greedy, qwen3_instruct)
  3. Model configs in config/model (default: qwenvl)
  4. Prompt configs in config/prompt (default: default) with text-based output and Role Prompt

Other settings include:

  1. Data Processing options, i.e. data.size=224 or data.split=cv
  2. Hardware settings, notably
    • batch_size: specifies how many videos are loaded into memory at once. Reduce of RAM-constrained
    • num_workers: Number of worker processed for data loading
  3. Wandb logging config, notably
    • wandb.mode (online, offline or disabled)
    • wandb.project (also configured by experiment)

Debugging options

  • num_samples (int): constrain the number of samples used for inference
  • vllm.use_mock (bool): if True, skip vLLM engine and produce random predictions for debugging purposes that do not depend on vLLM
  • vllm=debug for faster warm-up times

Tech Stack

We use the vLLM inference engine, optimized for high-throughput and memory-efficient LLM inference with multimodal support. Hydra is used for configuration management (see above)

Create the environment

  1. Install Conda
  2. Run
conda env create -f environment.yml
conda activate cu129_vllm15
  1. Install additional dependencies using uv (installed inside colab environment)
uv pip install vllm==0.15.1 --torch-backend=cu129
MAX_JOBS=4 uv pip install flash-attn==2.8.3 --no-build-isolation
uv pip install -r requirements.txt
uv pip install -r requirements-dev.txt
uv pip install -e .

At the time of writing, vLLM is compiled for cu129 by default. If you need a different version of CUDA, you have to install vLLM from source.

Environment variables

Required

OMNIFALL_ROOT=path/to/omnifall
VLLM_WORKER_MULTIPROC_METHOD=spawn

Recommended

These variables should be set before launching the vllm inference script.

CUDA_VISIBLE_DEVICES=0 # or e.g., 0,1
VLLM_CONFIGURE_LOGGING=0

Contributors 3

  •  
  •  
  •