Skip to content



  • Life-time GitHub Repo access
  • Access to future uploaded Vision Fine-tuning and data prep scripts
  • Ability to post Issues

Suitable for:

  • Fine-tuning vision models for custom image OR text datasets. [IDEFICS, Moondream, LLaVA].
  • Improving performance on specific vision tasks.
  • Inferencing IDEFICS2 with Text-Generation-Inference
  • Deploying a Moondream (or fine-tuned Moondream) server.

Repo Content = all you need to fine-tune vision models:

  • Dataset Preparation
  • Fine-tuning IDEFICS 2, Moondream and LLaVA
  • Moondream server (w/ simple batching)

GPU requirements for training


  • Can be fine-tuned with <16 GB in nf4.


  • Can be fine-tuned with less than 8 GB VRAM.

LLaVA Llama 3

  • weizhiwang/LLaVA-Llama-3-8B takes a min 24 GB of VRAM in bf16. 1x A6000 or 1x A100 (80GB).

LLaVA 1.5 (GPU requirements updated April 24th 2024)

  • liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000.
  • liuhaotian/llava-v1.5-13b REQUIRES VRAM OF <48 GB. Will run on 1x A6000 or 1xA100.

LLaVA 1.6 (GPU requirements updated April 24th 2024)

  • liuhaotian/llava-v1.6-mistral-7b takes a min 24 GB of VRAM in bf16. 1x A6000 or 1x A100 (80GB).
  • liuhaotian/llava-v1.6-34b takes about 80 GB of VRAM in bf16. 2xA6000 or 2x A100 (80GB) – very tight for 1xA100 but possible with context of 1024.

Purchase Options

Individual Access/License – Single Repo

Join 92+ life-time members

Provices repo access to one individual for to all of the ADVANCED-vision content above.

Individual Access/License – Repo Bundle

Provides access for one individual to all ADVANCED-vision content above, plus the ADVANCED-inference, ADVANCED-fine-tuning and ADVANCED-transcription repositories:

For TEAM access, kindly post a comment below.

Video Tutorials

Tiny Text + Vision Models – Fine-tuning and API Setup

Fine-tuning Multi-modal LLava Vision and Language Models

33 thoughts on “ADVANCED Vision”

  1. I was hoping this was included in the Advanced Fine-tuning Scripts repo. Is there a discount for people that bought the lifetime access to the Advanced Fine-tuning Scripts repo?

  2. Hello! This is a great source of information. I am still a student and I would like to access this repository for acquiring knowledge. Is there any discount available for students like me?

      1. Does the student discount code still work? Where do I input the code? The checkout page doesn’t seem to have an option for that. thanks

  3. Hi, before buying I just want to ask if this is the right repo I need for the full process of fine-tuning Llava on custom dataset? I saw many links in your tutorial and got confused!

  4. Hi, can I intend to implement this on the GCP Vertex AI VM. Can I use the repo based on the following configuration:

    4 x A100 (40GB) GPUs
    CUDA version 11.8

    1. Howdy! Yes, for the 7B Llava 1.6 model that looks like it will work. Here are the GPU specs required:

      ### GPU requirements for training

      LLaVA 1.5

      – liuhaotian/llava-v1.5-7b will run on a single A6000
      – liuhaotian/llava-v1.5-13b REQUIRES VRAM OF 80 GB+. Will run on 2x A6000

      LLaVA 1.6

      – liuhaotian/llava-v1.6-mistral-7b takes about 100 GB of VRAM in bf16. 3x A6000 or 2x A100 (80GB).
      – liuhaotian/llava-v1.6-vicuna-7b takes about 100 GB of VRAM in bf16. 3x A6000 or 2x A100 (80GB).
      – liuhaotian/llava-v1.6-34b takes about 300 GB of VRAM in bf16. 7x A6000 or 4x A100 (80GB).

    1. Two main costs:
      – Running the GPU for fine-tuning (free if you use colab, but can be more practical/faster to pay $0.7/hr for a runpod or vastai gpu)
      – Generating synthetic data (if you need it), for which you would need openai credits, unless you want to manually make the data with free chatgpt.

  5. Thanks for the breakdown. How long do I need to run, runpod for to get some decent results for example with 200 training images.

  6. Hi Ronan,

    You’re doing great work. I’m interested in buying access but wanted to know if you have a roadmap for adding new models?

    E.g. the new version of idefics, moondream2, qwenVL, etc.

    The space is developing very quickly and Id like to work with something that’s using SigLIP and a newer and smaller LLM

    1. Howdy, yeah working on both moondream and idefics right now.

      QwenVL I’m unlikely to cover as it’s worse than LlaVA 1.6 and badly supported by inference libraries. IDEFICs will soon be TGI supported.

  7. Hello Ronan,

    I really like your great job at fine-tuning LLaVA. I was wondering… Would it be possible to purchase your repo, follow your whole work pipeline, and later deploy locally one of my fine-tuned LLaVA models with a 24GB NVIDIA GeForce RTX 3090 Ti? If feasible, do you think it would be slow at inference?

    Also, any advice regarding quantization?

    Thanks in advance!

    1. Yes, that is all possible. See my video posted today (22nd April) for a fast inference option with MoonDream. Quantization isn’t recommended for production grade because it hurts quality and can also slow inference.

  8. Hello

    I need to train the model to recognize Windows UI, App UI, Web/Browsers UI.
    I want to build an AI agent to auto use my pc like a human.
    Is this suitable for my usecase?

  9. Hi Ronan

    I want to purchase this now but just want to confirm if possible first.

    Can I train this vision model to detect a button on a image and respond with the coordinates? I have a big amount of images but want to see what approach I could take with this.

    Kind regards

    1. Yes, that is the kind of problem that should be fine-tunable. Whether the smallest moondream model is strong enough is a bit uncertain, but using a larger model in the model it should be (although I recommend starting small).

  10. Hi Ronan,
    before buying the repo, i want to mention about my project..
    i want to use vlm model to extract fashion product attributes, for example collar type, sleeve type etc…. Will your solutions help me about it ?
    and i want to use my vlm on locallay… i will have resources for it.
    i wonder your comments and advices. thank you already.
    ps: this is great job what u did, thank you.

  11. Hi Ronan
    So for my use case is a bit complicated, I am trying to train model with SLDPRT/SLDASM(solidworks files) or train model with images and convert picture image to a 3d object using SLDPRT/ASM files. Is it possible to do it using LLAVA or Multi Modal models?


  12. Hi Ronan,

    You are a complete legend. I was wondering if you have any thoughts on if/how these vision models can be fine-tuned to accept multiple images as input (eg with a text question asking to compare/contrast)?


  13. Any idea of fine-tuning Llava using Apple Silicon, like a M2 with 64GB RAM?
    Llava 7B model could be fine tuned on 64GB RAM?

    I want to classify Wikicommons images in about 30 category classes, have a NSFW label/probability and some description fields.

    I’m thinking about fine-tuning Wikicommons images and adding better meta data fields, to improve image search on It’s a lot of images, but resolution would be low-is so bandwidth should be manageable.

    It’s a hobby project to make an open-source Wikicommons image dataset on HF.

Leave a Reply

Your email address will not be published. Required fields are marked *