Talk to us

MyMagic AI May 22, 2024

Batch inference with MyMagic AI and LlamaIndex

This is a guest post from MyMagic AI.

MyMagic AI allows processing and analyzing large datasets with AI. MyMagic AI offers a powerful API for batch inference (also known as offline or delayed inference) that brings various open-source Large Language Models (LLMs) such as Llama 70B, Mistral 7B, Mixtral 8x7B, CodeLlama70b, and advanced Embedding models to its users. Our framework is designed to perform data extraction, summarization, categorization, sentiment analysis, training data generation, and embedding, to name a few. And now it's integrated directly into LlamaIndex!

Part 1: batch inference

How It Works:

1. Setup:

  1. Organize Your Data in an AWS S3 or GCS Bucket:
    1. Create a folder using your user ID assigned to you upon registration.
    2. Inside that folder, create another folder (called a "session") to store all the files you need for your tasks.
  2. Purpose of the 'Session' Folder:
    1. This "Session" folder keeps your files separate from others, making sure that your tasks run on the right set of files. You can name your session subfolder anything you like.
  3. Granting Access to MyMagic AI:
    1. To allow MyMagic AI to securely access your files in the cloud, follow the setup instructions provided in the MyMagic AI documentation.

2. Install: Install both MyMagic AI’s API integration and LlamaIndex library:

pip install llama-index
pip install llama-index-llms-mymagic

3. API Request: The llamaIndex library is a wrapper around MyMagic AI’s API. What it does under the hood is simple: it sends a POST request to the MyMagic AI API while specifying the model, storage provider, bucket name, session name, and other necessary details.

import asyncio
from llama_index.llms.mymagic import MyMagicAI

llm = MyMagicAI(
    api_key="user_...", # provided by MyMagic AI upon sign-up
    bucket_name="batch-bucket", # you may name anything
    role_arn="arn:aws:iam::<your account id>:role/mymagic-role",
    system_prompt="You are an AI assistant that helps to summarize the documents without essential loss of information", # default prompt at https://docs.mymagic.ai/api-reference/endpoint/create

We have designed the integration to allow the user to set up the bucket and data together with the system prompt when instantiating the llm object. Other inputs, e.g. question (i.e. your prompt), model and max_tokens are dynamic requirements when submitting complete and acomplete requests.

resp = llm.complete(
    question="Summarise this in one sentence.",
    max_tokens=20,  # default is 10
async def main():
    aresp = await llm.acomplete(
        question="Summarize this in one sentence.",


This dynamic entry allows developers to experiment with different prompts and models in their workflow while also controlling for model output to cap their spending limit. MyMagic AI’s backend supports both synchronous requests (complete) and asynchronous requests (acomplete). It is advisable, however, to use our async endpoints as much as possible as batch jobs are inherently asynchronous with potentially long processing times (depending on the size of your data).

Currently, we do not support chat or achat methods as our API is not designed for real-time interactive experience. However, we are planning to add those methods in the future that will function in a “batch way”. The user queries will be aggregated and appended as one prompt (to give the chat context) and sent to all files at once.

Use Cases

While there are myriads of use cases, here we provide a few to help motivate our users. Feel free to embed our API in your workflows that are good fit for batch processing.

1. Extraction

Imagine needing to extract specific information from millions of files stored in a bucket. Information from all files will be extracted with one API call instead of a million sequential ones.

2. Classification

For businesses looking to classify customer reviews such as positive, neutral, and negative. With one request you can start processing the requests over the weekend and get them ready by Monday morning.

3. Embedding

Embedding text files for further machine learning applications is another powerful use case of MyMagic AI's API. You will be ready for your vector db in a matter of days not weeks.

4. Training (Fine-tuning) Data Generation

Imagine generating thousands of synthetic data for your fine-tuning tasks. With MyMagic AI’s API, you can reduce the generation time by a factor of 5-10x compared to GPT-3.5.

5. Transcription

MyMagic AI’s API supports different types of files, so it is also easy to batch transcribe many mp3 or mp4 files in your bucket.

Part 2: Integration with LlamaIndex’s RAG Pipeline

The output from batch inference processes, often voluminous, can seamlessly integrate into LlamaIndex's RAG pipeline for effective data storage and retrieval.

This section demonstrates how to use the Llama3 model from the Ollama library coupled with BGE embedding to manage information storage and execute queries. Please ensure the following prerequisites are installed and Llama3 model is pulled:

pip install llama-index-embeddings-huggingface
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3

For this demo, we have run a batch summarization job on 5 Amazon reviews (but this might be millions in some real scenarios) and saved the results as reviews_1_5.json:

  "id_review1": {
    "query": "Summarize the document!",
    "output": "The document describes a family with a young boy who believes there is a zombie in his closet, while his parents are constantly fighting. The movie is criticized for its inconsistent genre, described as a slow-paced drama with occasional thriller elements. The review praises the well-playing parents and the decent dialogs but criticizes the lack of a boogeyman-like horror element. The overall rating is 3 out of 10."
  "id_review2": {
    "query": "Summarize the document!",
    "output": "The document is a positive review of a light-hearted Woody Allen comedy. The reviewer praises the witty dialogue, likable characters, and Woody Allen's control over his signature style. The film is noted for making the reviewer laugh more than any recent Woody Allen comedy and praises Scarlett Johansson's performance. It concludes by calling the film a great comedy to watch with friends."
  "id_review3": {
    "query": "Summarize the document!",
    "output": "The document describes a well-made film about one of the great masters of comedy, filmed in an old-time BBC fashion that adds realism. The actors, including Michael Sheen, are well-chosen and convincing. The production is masterful, showcasing realistic details like the fantasy of the guard and the meticulously crafted sets of Orton and Halliwell's flat. Overall, it is a terrific and well-written piece."
  "id_review4": {
    "query": "Summarize the document!",
    "output": "Petter Mattei's 'Love in the Time of Money' is a visually appealing film set in New York, exploring human relations in the context of money, power, and success. The characters, played by a talented cast including Steve Buscemi and Rosario Dawson, are connected in various ways but often unaware of their shared links. The film showcases the different stages of loneliness experienced by individuals in a big city. Mattei successfully portrays the world of these characters, creating a luxurious and sophisticated look. The film is a modern adaptation of Arthur Schnitzler's play on the same theme. Mattei's work is appreciated, and viewers look forward to his future projects."
  "id_review5": {
    "query": "Summarize the document!",
    "output": "The document describes the TV show 'Oz', set in the Oswald Maximum Security State Penitentiary. Known for its brutality, violence, and lack of privacy, it features an experimental section of the prison called Em City, where all the cells have glass fronts and face inwards. The show goes where others wouldn't dare, featuring graphic violence, injustice, and the harsh realities of prison life. The viewer may become comfortable with uncomfortable viewing if they can embrace their darker side."
  "token_count": 3391

Now let’s embed and store this document and ask questions using LlamaIndex’s query engine. Bring in our dependencies:

import os

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.indices.vector_store import VectorStoreIndex
from llama_index.core.settings import Settings
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.llms.ollama import Ollama

Configure the embedding model and Llama3 model

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
llm = Ollama(model="llama3", request_timeout=300.0)

Update settings for the indexing pipeline:

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512 # This parameter defines the size of text chunks for embedding

documents = SimpleDirectoryReader("reviews_1_5.json").load_data() #Modify path for your case

Now create our index, our query engine and run a query:

index = VectorStoreIndex.from_documents(documents, show_progress=True)

query_engine = index.as_query_engine(similarity_top_k=3)

response = query_engine.query("What is the least favourite movie?")


Based on query results, the least favourite movie is: review 1 with a rating of 3 out of 10.

Now we know that the review 1 is the least favorite movie among these reviews.

Next Steps

This shows how batch inference combined with real-time inference can be a powerful tool for analyzing, storing and retrieving information from massive amounts of data. Get started with MyMagic AI’s API today!