How can you lower the cost of your LLM?

This question is aimed at those whose main product involves chatting with an LLM. What strategies do you use to lower LLM costs? Additionally, are you enhancing user input in any way? Thanks in advance!

Edit 1: The community here has shared some insights. Here’s a summary of the first 23 responses:

  • Using a smaller model, like ChatGPT 4o mini, is a common suggestion, though it may not fit all use cases.
  • Caching is advised, but it has limitations. Some users have mentioned that Anthropic’s caching is effective.
  • There’s a recommendation for route LLM, Llama by fireworks.ai, and arliai.
  • RAG (retrieval-augmented generation) has been noted as useful.
  • Batch API calls to ChatGPT are also recommended.
  • Fine-tuning has been mentioned as another option.

assuming that openai apis are being used…

Make advantage of gpt-40-mini. It suffices for the majority of tasks.

Make use of the openai batch api calls if your use case permits. provides an additional 50% off.

cache outputs and prompts. then you can display the outcomes of an earlier cached request.
Make an effort to improve your prompts. eliminate any superfluous spaces, etc.
I incorporate all of these strategies with my product at surveyloom.com.
if you need more assistance, I’d be pleased to examine your setup.

I just released a Telegram bot that uses GPT-4o mini to learn languages. Also took some time to determine the RAG length’s centre ground. It costs about $3 for 10,000 message generation. It’s not too horrible.

I needed some time to get a small model to provide meaningful signals. The idea was to limit the prompt to only the most important details and to make it brief.
Therefore, if I were you, I would assess whether using a less expensive model would still yield satisfactory results.

Llama and other models from Fireworks.ai are capable of achieving many of the tasks that Open AI can. Additionally, it is incredibly quick because to their clever caching techniques.

The answer to this question greatly depends on your work.

Simple solutions like “use the cheapest model available” may suffice, but they may not adequately handle the work at hand or other issues.

Depending on what you’re doing. Try using lemmatisation and eliminating stop words when summarising lengthy texts.
The, at, of, for, and other words are stop words.
Lemmatisation is the process of reducing words to their most basic form, as in Running to Run.

Even if the original text is altered, contemporary LLMs are still able to comprehend the context.
For this, there are Python packages like Spacy that are quite simple to use.

Oof, that depends on how accurate you want to be. Mini committed too many errors for our domain. Having said that, RAG can be quite helpful, particularly if it is constructed to accurately identify relevance for contextual retrieval.