Using mistral-small-3.1-24b-instruct for a vision task

What is the name of the modal you’re running?

mistral-small-3.1-24b-instruct

What is the error number?

3030

What is the error message?

030: 3 validation errors for ValidatorIterator 0.typed-dict Input should be a valid dictionary [type=dict_type, input_value=‘type’, input_type=str]

What is the issue or error you’re encountering

I can’t figure out exactly what the api expects when doing visual tasks with mistral-small-3.1-24b-instruct.

What steps have you taken to resolve the issue?

So I have successfully used @cf/meta/llama-3.2-11b-vision-instruct for a vision task and I’m trying to now use @cf/mistralai/mistral-small-3.1-24b-instruct since it claims to be start of the art model for lang and vision on the cloudflare page : mistral-small-3.1-24b-instruct · Cloudflare Workers AI docs

I’ve also used the mitralai model purely for language tasks.

The main roadblock is I cannot figure the API, the only documentation included says :

that the message content should look like :

{type (string) : Type of the content provided
 text (string) :
image_url : { url (string):  image uri with data (e.g. data:image/jpeg;base64,/9j/...). HTTP URL will not be accepted }
}

It doesn’t really say what’s expected to be in type though

Here is my whole code that calls to llama and then to mistral for comparison. Llama code does work.

      // GET the url
      const response = await fetch(URL)

    
      // XXXXX (mtourne) this needs to be done just once but to accept the license.
      //  Note : https://developers.cloudflare.com/workers-ai/models/llama-3.2-11b-vision-instruct/
      //  To use Llama 3.2 11b Vision Instruct, you need to agree to the Meta License and Acceptable Use Policy . To do so, please send an initial request to @cf/meta/llama-3.2-11b-vision-instruct with "prompt" : "agree". After that, you'll be able to use the model as normal.
      // 
      // await c.env.AI.run("@cf/meta/llama-3.2-11b-vision-instruct", {
      //   prompt: "agree"
      // });

      const blob = await response.arrayBuffer();
      const blob_u8 = new Uint8Array(blob)
      function encodeBase64Bytes(bytes: Uint8Array): string {
        return btoa(
          bytes.reduce((acc, current) => acc + String.fromCharCode(current), "")
        );
      }

     const image = [...blob_u8];
      

      const results = await c.env.AI.run("@cf/meta/llama-3.2-11b-vision-instruct", {
        messages: [
          {
            role: "system",
            content: "You are an expert at labeling images, tell me what you see in the following image."
          },
        ],
        image: image,
      });

      await ctx.reply("Response Llama vision:")
      await ctx.reply(results['response'])


      const content_type = 'image/jpeg';
      const image_b64 = encodeBase64Bytes(blob_u8);

      const uri_encoded_image = `data:${content_type};base64,${image_b64}`

      
      //// XX (mtourne): we're trying to send data to mistral which should also be capable of vision reasoning, but I can't figure out the API format
      const results2 = await c.env.AI.run("@cf/mistralai/mistral-small-3.1-24b-instruct", {
        messages: [
          {
            role: "system",
            content: "You are an expert at labeling images, tell me what you see in the following image."
          },
          {
            role: "user",
            content: { 
              type: content_type,
              text: "Here is the image",
              image_url : { url: uri_encoded_image },
             },
          },
        ],
      });

      await ctx.reply("Response Mistral 24b:")
      return ctx.reply(results2['response']);

This topic was automatically closed after 15 days. New replies are no longer allowed.