Skip to content

ADVANCED Vision

Features:

  • Life-time GitHub Repo access
  • Access to future uploaded Vision Fine-tuning and data prep scripts
  • Ability to post Issues

Suitable for:

  • Fine-tuning vision models for custom image OR text datasets.
  • Improving performance on specific vision tasks.
  • Deploying a Moondream (or fine-tuned Moondream) server

Repo Content = all you need to fine-tune vision models:

  • Dataset Preparation
  • Fine-tuning Llava and IDEFICS
  • Fine-tuning Moondream
  • Moondream server (w/ simple batching)

GPU requirements for training

Moondream

  • Can be fine-tuned with less than 8 GB VRAM.

LLaVA 1.5 (GPU requirements updated April 24th 2024)

  • liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000.
  • liuhaotian/llava-v1.5-13b REQUIRES VRAM OF <48 GB. Will run on 1x A6000 or 1xA100.

LLaVA 1.6 (GPU requirements updated April 24th 2024)

  • liuhaotian/llava-v1.6-mistral-7b takes a min 24 GB of VRAM in bf16. 1x A6000 or 1x A100 (80GB).
  • liuhaotian/llava-v1.6-34b takes about 80 GB of VRAM in bf16. 2xA6000 or 2x A100 (80GB) – very tight for 1xA100 but possible with context of 1024.

Note: Price for new buyers increases periodically as new content is added.


Video Tutorials

Tiny Text + Vision Models – Fine-tuning and API Setup

Fine-tuning Multi-modal LLava Vision and Language Models

22 thoughts on “ADVANCED Vision”

  1. I was hoping this was included in the Advanced Fine-tuning Scripts repo. Is there a discount for people that bought the lifetime access to the Advanced Fine-tuning Scripts repo?

  2. Hello! This is a great source of information. I am still a student and I would like to access this repository for acquiring knowledge. Is there any discount available for students like me?

  3. Hi, before buying I just want to ask if this is the right repo I need for the full process of fine-tuning Llava on custom dataset? I saw many links in your tutorial and got confused!

  4. Hi, can I intend to implement this on the GCP Vertex AI VM. Can I use the repo based on the following configuration:

    4 x A100 (40GB) GPUs
    CUDA version 11.8

    1. Howdy! Yes, for the 7B Llava 1.6 model that looks like it will work. Here are the GPU specs required:

      ### GPU requirements for training

      LLaVA 1.5

      – liuhaotian/llava-v1.5-7b will run on a single A6000
      – liuhaotian/llava-v1.5-13b REQUIRES VRAM OF 80 GB+. Will run on 2x A6000

      LLaVA 1.6

      – liuhaotian/llava-v1.6-mistral-7b takes about 100 GB of VRAM in bf16. 3x A6000 or 2x A100 (80GB).
      – liuhaotian/llava-v1.6-vicuna-7b takes about 100 GB of VRAM in bf16. 3x A6000 or 2x A100 (80GB).
      – liuhaotian/llava-v1.6-34b takes about 300 GB of VRAM in bf16. 7x A6000 or 4x A100 (80GB).

    1. Two main costs:
      – Running the GPU for fine-tuning (free if you use colab, but can be more practical/faster to pay $0.7/hr for a runpod or vastai gpu)
      – Generating synthetic data (if you need it), for which you would need openai credits, unless you want to manually make the data with free chatgpt.

  5. Thanks for the breakdown. How long do I need to run, runpod for to get some decent results for example with 200 training images.

  6. Hi Ronan,

    You’re doing great work. I’m interested in buying access but wanted to know if you have a roadmap for adding new models?

    E.g. the new version of idefics, moondream2, qwenVL, etc.

    The space is developing very quickly and Id like to work with something that’s using SigLIP and a newer and smaller LLM

    1. Howdy, yeah working on both moondream and idefics right now.

      QwenVL I’m unlikely to cover as it’s worse than LlaVA 1.6 and badly supported by inference libraries. IDEFICs will soon be TGI supported.

  7. Hello Ronan,

    I really like your great job at fine-tuning LLaVA. I was wondering… Would it be possible to purchase your repo, follow your whole work pipeline, and later deploy locally one of my fine-tuned LLaVA models with a 24GB NVIDIA GeForce RTX 3090 Ti? If feasible, do you think it would be slow at inference?

    Also, any advice regarding quantization?

    Thanks in advance!

    1. Yes, that is all possible. See my video posted today (22nd April) for a fast inference option with MoonDream. Quantization isn’t recommended for production grade because it hurts quality and can also slow inference.

  8. Hello

    I need to train the model to recognize Windows UI, App UI, Web/Browsers UI.
    I want to build an AI agent to auto use my pc like a human.
    Is this suitable for my usecase?

  9. Hi Ronan

    I want to purchase this now but just want to confirm if possible first.

    Can I train this vision model to detect a button on a image and respond with the coordinates? I have a big amount of images but want to see what approach I could take with this.

    Kind regards
    Jonathan

    1. Yes, that is the kind of problem that should be fine-tunable. Whether the smallest moondream model is strong enough is a bit uncertain, but using a larger model in the model it should be (although I recommend starting small).

Leave a Reply

Your email address will not be published. Required fields are marked *