Skip to content

Advanced Inference Scripts


  • Life-time GitHub Repo access
  • Access to future uploaded Inference Scripts
  • Ability to post Issues

Repo Content = all you need to set up an inference API:

  • Server Setup Guides: EC2 instances (e.g. AWS), Runpod or Vast.AI
  • API Setup Guides:
    • Platforms: RunPod, VastAI or your own laptop
    • Inference Libraries: vLLM, TGI, llama.cpp
    • Also, Serverless deployment with Runpod.
  • API-Calling Scripts (OpenAI, Runpod and VastAI):
    • Simple API calling scripts
    • Function calling API calling scripts (handles function calling, execution and response)
    • Speed tests (simple and concurrent requests)
    • Data extraction scripts (JSON and YAML) [purchase individually here]
    • ALL-SORT (LLM assisted retrieval) scripts [purchase individually here]
  • Sensitive Data Anonymization Scripts [purchase individually here]:
    • Named Entity Extraction with Presidio
    • Sensitive Info Extraction with Phi-3 Mini and Outlines
  • Monte Carlo Simulation for Improved LLM Accuracy [purchase individually here]

Purchase Options

Individual Access/License – ADVANCED-Inference Repo

Join 185+ life-time members!

Provides lifetime (including any future additions) repo access for one individual to all of the ADVANCED-inference content above.

Individual Access/License – Repo Bundle

Provides lifetime (including any future additions) individual access to all ADVANCED-inference content above, PLUS the ADVANCED-fine-tuning, ADVANCED-transcription and ADVANCED-vision repositories.

Team Access/License – ADVANCED-Inference Repo

Provides lifetime (including any future additions) team access (up to 5 members) to all of the ADVANCED-inference content above.

Video Tutorials

Improving LLM Accuracy with Monte Carlo Tree Search

Anonymizing Sensitive Data in LLM Prompts

Deploying Serverless Endpoints

Improved Retrieval Augmented Generation with ALL-SORT

Data Extraction with LLMs

Serve a Custom LLM for Over 100 Customers

Deploy Llama 2 on an EC2 server

Run Llama 2 with 32k context length

Deploy a Llama 2 70B API in 5-clicks with AWQ

25 thoughts on “Advanced Inference Scripts”

    1. Yes, if you are running an EC2 server on AWS that has a GPU, this repo provides install instructions.

      Separately, the repo explains how you can use runpod (which is cheaper per hour, unless you have lots of free AWS credits).

  1. Hello Ronan,

    Hope you are doing well.
    Quick question. If I were to purchase all of your repos, would you be willing to offer a bundle/better price? I believe your work will expedite our research.


  2. Hi Ronan,
    Apologies I seem to have purchased the QLora_Fine_Tuning unsupervised notebook while I thought it was access to the github repo. I suppose I have now seen the content of the notebook but definitely isn’t what I was looking for. Any chance to redirect the order to have the github repo?

  3. Hi Ronan

    I have been watching your youtube videos, They are very good an informative.

    I was wondering if is possible to split the Mixtral LLM over multiple GPUS, I have some old 6gb ethereum mining rigs that i wanted to use for AI projects, They can have up to 6 gpu, So i can run 36gb of Vram split over 6 cards

    Many Thnaks


    1. Cheers Andrew,

      I think you can run Mixtral locally across multiple GPUs using Text Generation Inference. A few steps:
      – Install TGI (you can take a look at my youtube video on Enterprise Server Setup)
      – Run TGI (you can maybe borrow the params I use in the Mixtral runpod template – Use quantization (–quantize bitsandbytes-nf4) to fit the model on your 6×6 GB setup.

      Now, this setup probably will work, but may not be optimal. Optimal would maybe be having one expert per GPU, but anyway you have 6 GPUs so that won’t be possible. Either way, I think this overall should work.

    1. Yes, no problem. You get a receipt when you buy something, and if you need further info on the receipt just respond to that email. (I see you did respond and I’ll send over an updated receipt shortly).

    1. Yes, the inference repo does handle json outputs for function calling and use that information to make function calls and feed them back to the model.

      Is that what you are asking?

      If you look at the latest function calling video you’ll see me using the inference repo and how JSON outputs are handled.

      No, I don’t use the guidance library. Guidance is a good idea in principle but just isn’t supported well for integrating with a lot of open source models. I just use the python json library.

    1. Howdy, we were in touch by email but I’m answering here for the benefit of others.

      Access is automatically provided to your github username. If you don’t get an email, go to github and click on the github logo to the top right to see your activity.

      Any trouble, just respond to the receipt from your purchase and I’ll help.

  4. Hello Treils, thanks for this hard work.

    My question is if I pay 99$ do I have access to all your repos or just to these Advanced Inference Scripts ? I would like to have access to this repo too: Advanced Fine-tuning Scripts.
    One more question: I would like to manage data locally on my PC and send it to the Docker image created on Runpod: can I do that with your API scripts ? Is there any video of yours showing the API call in details ?

    Thanks a lot for you answer.

    1. Each repo is purchased separately. This page describes and provides a link for ADVANCED Inference Scripts. The ADVANCED Fine-tuning repo is separate.

      If you are inferencing a runpod API, then yes you can make calls from your local PC (including sending your data). The video “Serve a Custom LLM for Over 100 Customers” goes through this.

      1. Dear Ronan,
        thanks a trillion for this reply of yours and, again, for all this effort.

        Am I correct in buying the Advanced Inference Script to use API calls ?
        One quick thing, I have seen three times your video: would you mind doing a short one about Runpod and not ? To use the API with MiXtral, am I correct in setting up a pod using the template called “Mixtral Instruct API by Trelis” ?

        Again and once more, thanks a million for any help and assistance you may want to give me about this topic. Priceless.

  5. Hello Treils,
    i have finetuned mistral-instruct-7b-gptq model, now i want to speed up its inference time, i’m using runpod.
    So, could you clarify which repository is preferable for my use case?

    1. Howdy! ok yeah so if you have a gptq model, then your best option is probably to just run that model with vllm and/or runpod. You can just go to the free repo, find a template for mixtral, and then edit the template for your model name AND change awq to gptq (and, for vllm, I think you may need to get rid of dtype half).

      gptq is quite well supported but the quality is a lot worse than awq for the same speed. Ideally you would have fine-tuned in bf16 and then quantized with awq (see the youtube video).

  6. Thanks for the replay Ronan.

    i want script to finetune mistral-instruct-7b on my custom data with bf16, then need ti convert in AWQ, How can i do that.

    Thank you

  7. Hi Ronan,
    i have fintuned model(mistral-instruct-7b) in bf16 and then quantized it in AWQ, now i want to setup or make template for this model using vLLm on runpod, so i can use this model as a api.

    How can we setup, any reference for the same?

  8. I’m trying to download youtube videos for transcription and I watched your video using Whisper and I’ve used it no problem. But now there is the Distil -whisper model. Do you have a solution for Distil and what would it cost? I’m really just trying to use an end solution, not really understand how it works. I’m not that smart.


Leave a Reply

Your email address will not be published. Required fields are marked *