Introducing WebMarker

Last updated
Reid Barber

Mark web pages for use with multimodal large language models

GitHub | Docs

0 marked elements

Overview

WebMarker is a JavaScript library used for adding visual markers and labels to elements on a web page. This can be used for Set-of-Mark prompting, which improves the visual grounding abilities of vision-enabled large language models such as GPT-4o, Claude 3.5, and Google Gemini 1.5. This library aims to:

  • Improve LLM performance on vision tasks referencing web pages
  • Enable reliable web page interactions based on LLM responses

How it works

1. Call the mark() function

This marks the interactive elements on the page and returns an object containing the marked elements.

Click "Mark this page" above to see how pages get marked by default. The mark styles and label values are fully customizable.

In the returned object, each key is a mark label string, and each value is an object with the following properties:

  • element: The interactive element that was marked.
  • markElement: The label added for that element.
  • boundingBoxElement: The bounding box added over the element.

A data-mark-label attribute containing the label is also added to each marked element.

You can use this information to build your prompt for the large language model.

2. Send a screenshot of the marked page to a large language model, along with your prompt

Example prompt:

let markedElements = await mark()

let prompt = `The following is a screenshot of a web page.

Interactive elements have been marked with red bounding boxes and labels.

When referring to elements, use the labels to identify them.

Return an action and element to perform the action on.

Available actions: click, hover

Available elements:
${Object.keys(markedElements)
  .map((label) => `- ${label}`)
  .join('\n')}

Example response: click 0
`

3. Programmatically interact with the marked elements.

In a web browser environment (i.e., Playwright), you can interact with elements as needed and generate accurate selectors based on the object returned by mark().

For prompting or web agent ideas, see the WebVoyager paper.

Playwright example

// Inject the WebMarker library into the page
await page.addScriptTag({
  url: 'https://cdn.jsdelivr.net/npm/webmarker-js/dist/main.js',
})

// Mark the page and get the marked elements
let markedElements = await page.evaluate(async () => await WebMarker.mark())

// (Optional) Check if page is marked
let isMarked = await page.evaluate(async () => await WebMarker.isMarked())

// (Optional) Unmark the page
await page.evaluate(async () => await WebMarker.unmark())

Development status

WebMarker is currently in alpha status and will be converging on a stable API and feature set in the near future. Feel free to create feature requests or report issues in the GitHub repository.