Azure OpenAI의 컴퓨터 사용(미리 보기)

2025-07-21

이 문서를 사용하여 Azure OpenAI에서 컴퓨터 사용을 사용하는 방법을 알아봅니다. 컴퓨터 사용은 UI를 통해 컴퓨터 시스템 및 애플리케이션과 상호 작용하여 작업을 수행할 수 있는 특수 모델을 사용하는 특수한 AI 도구입니다. 컴퓨터 사용을 사용하면 시각적 요소를 해석하고 화면 콘텐츠에 따라 작업을 수행하여 복잡한 작업을 처리하고 결정을 내릴 수 있는 에이전트를 만들 수 있습니다.

컴퓨터 사용은 다음을 제공합니다.

자율 탐색: 예를 들어 애플리케이션을 열고, 단추를 클릭하고, 양식을 입력하고, 다중 페이지 워크플로를 탐색합니다.
동적 적응: UI 변경 내용을 해석하고 그에 따라 작업을 조정합니다.
애플리케이션 간 태스크 실행: 웹 기반 및 데스크톱 애플리케이션에서 작동합니다.
자연어 인터페이스: 사용자는 일반 언어로 작업을 설명할 수 있으며 컴퓨터 사용 모델은 실행할 올바른 UI 상호 작용을 결정합니다.

액세스 요청

모델에 액세스하려면 computer-use-preview 등록이 필요하며 Microsoft의 자격 기준에 따라 액세스 권한이 부여됩니다. 다른 제한된 액세스 모델에 대한 액세스 권한이 있는 고객은 이 모델에 대한 액세스를 요청해야 합니다.

액세스 요청: 컴퓨터 사용 미리보기 제한 액세스 모델 신청

액세스 권한이 부여되면 모델에 대한 배포를 만들어야 합니다.

국가별 지원

컴퓨터 사용은 다음 지역에서 사용할 수 있습니다.

eastus2
swedencentral
southindia

응답 API를 사용하여 컴퓨터 사용 모델에 API 호출 보내기

컴퓨터 사용 도구는 응답 API를 통해 액세스됩니다. 이 도구는 텍스트 입력 또는 클릭 수행과 같은 작업을 보내는 연속 루프에서 작동합니다. 코드는 컴퓨터에서 이러한 작업을 실행하고 결과의 스크린샷을 모델에 보냅니다.

이러한 방식으로 코드는 컴퓨터 인터페이스를 사용하여 사람의 동작을 시뮬레이션하는 반면, 모델은 스크린샷을 사용하여 환경의 상태를 이해하고 다음 작업을 제안합니다.

다음 예제에서는 기본 API 호출을 보여 줍니다.

비고

computer-use-preview 모델 배포를 사용하는 Azure OpenAI 리소스가 필요합니다.

파이썬
REST API

요청을 보내려면 다음 Python 패키지를 설치해야 합니다.

pip install openai
pip install azure-identity

import os
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI

#from openai import OpenAI
token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",  
  azure_ad_token_provider=token_provider,
  api_version="preview"
)

response = client.responses.create(
    model="computer-use-preview", # set this to your model deployment name
    tools=[{
        "type": "computer_use_preview",
        "display_width": 1024,
        "display_height": 768,
        "environment": "browser" # other possible values: "mac", "windows", "ubuntu"
    }],
    input=[
        {
            "role": "user",
            "content": "Check the latest AI news on bing.com."
        }
    ],
    truncation="auto"
)

print(response.output)

출력

[
    ResponseComputerToolCall(
        id='cu_67d841873c1081908bfc88b90a8555e0', 
        action=ActionScreenshot(type='screenshot'), 
        call_id='call_wwEnfFDqQr1Z4Edk62Fyo7Nh', 
        pending_safety_checks=[], 
        status='completed', 
        type='computer_call'
    )
]

curl ${MY_ENDPOINT}/openai/v1/responses?api-version=preview \ 
  -H "Content-Type: application/json" \ 
  -H "api-key: $MY_API_KEY" \ 
  -d '{ 
    "model": "computer-use-preview", 
    "input": [ 
      { 
        "type": "message", 
        "role": "user", 
        "content": "Check the latest AI news on bing.com." 
      }
    ],
    "tools": [{
        "type": "computer_use_preview",
        "display_width": 1024,
        "display_height": 768,
        "environment": "browser" 
    }],
    "truncation":"auto"
  }'

출력

{
  "id": "resp_xxxxxxxxxxxxxxxxxxxxxxxx",
  "object": "response",
  "created_at": 1742227653,
  "status": "completed",
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "max_output_tokens": null,
  "model": "computer-use-preview",
  "output": [
    {
      "type": "computer_call",
      "id": "cu_xxxxxxxxxxxxxxxxxxxxxxxxxx",
      "call_id": "call_xxxxxxxxxxxxxxxxxxxxxxx",
      "action": {
        "type": "screenshot"
      },
      "pending_safety_checks": [],
      "status": "completed"
    }
  ],
  "parallel_tool_calls": true,
  "previous_response_id": null,
  "reasoning": {
    "effort": "medium",
    "generate_summary": null
  },
  "store": true,
  "temperature": 1.0,
  "text": {
    "format": {
      "type": "text"
    }
  },
  "tools": [
    {
      "type": "computer_use_preview",
      "display_height": 768,
      "display_width": 1024,
      "environment": "browser"
    }
  ],
  "top_p": 1.0,
  "truncation": "auto",
  "usage": {
    "input_tokens": 519,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 7,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 526
  },
  "user": null,
  "metadata": {}
}

초기 API 요청이 전송되면 애플리케이션 코드에서 지정된 작업이 수행되는 루프를 수행하여 모델이 환경의 업데이트된 상태를 평가할 수 있도록 각 턴에 스크린샷을 보냅니다.

파이썬
REST API


## response.output is the previous response from the model
computer_calls = [item for item in response.output if item.type == "computer_call"]
if not computer_calls:
    print("No computer call found. Output from model:")
    for item in response.output:
        print(item)

computer_call = computer_calls[0]
last_call_id = computer_call.call_id
action = computer_call.action

# Your application would now perform the action suggested by the model
# And create a screenshot of the updated state of the environment before sending another response

response_2 = client.responses.create(
    model="computer-use-preview",
    previous_response_id=response.id,
    tools=[{
        "type": "computer_use_preview",
        "display_width": 1024,
        "display_height": 768,
        "environment": "browser" # other possible values: "mac", "windows", "ubuntu"
    }],
    input=[
        {
            "call_id": last_call_id,
            "type": "computer_call_output",
            "output": {
                "type": "input_image",
                # Image should be in base64
                "image_url": f"data:image/png;base64,{<base64_string>}"
            }
        }
    ],
    truncation="auto"
)

curl ${MY_ENDPOINT}/openai/v1/responses?api-version=preview \ 
  -H "Content-Type: application/json" \ 
  -H "api-key: $MY_API_KEY" \ 
  -d '{ 
    "model": "computer-use-preview", 
    "input": [ 
      "tools": [{
        "type": "computer-preview",
        "display_width": 1024,
        "display_height": 768,
        "environment": "browser" # other possible values: "mac", "windows", "ubuntu"
      }], 
        {
        "call_id": last_call_id,
        "type": "computer_call_output",
        "output": {
            "type": "input_image",
            "image_url": "<base64_string>"
        }
      }
    ],
    "truncation":"auto"
  }'

컴퓨터 사용 통합 이해

컴퓨터 사용 도구를 사용하는 경우 일반적으로 다음을 수행하여 애플리케이션에 통합합니다.

컴퓨터 사용 도구에 대한 호출과 표시 크기 및 환경이 포함된 요청을 모델에 보냅니다. 첫 번째 API 요청에 환경의 초기 상태 스크린샷을 포함할 수도 있습니다.
모델에서 응답을 받습니다. 응답에 항목이 있는 action 경우 해당 항목에는 지정된 목표를 향해 진행하기 위한 제안된 작업이 포함됩니다. 예를 들어 모델이 업데이트된 스크린샷을 사용하거나 screenshot 마우스를 이동할 위치를 나타내는 X/Y 좌표를 사용하여 현재 상태를 평가할 수 있도록 하는 작업이 click 있을 수 있습니다.
컴퓨터 또는 브라우저 환경에서 애플리케이션 코드를 사용하여 작업을 실행합니다.
작업을 실행한 후 환경의 업데이트된 상태를 스크린샷으로 캡처합니다.
업데이트된 상태로 computer_call_output 새로운 요청을 보내고, 모델이 작업 요청을 중지하거나 중지하기로 결정할 때까지 이 루프를 반복합니다.

대화 기록 처리

매개 변수를 previous_response_id 사용하여 현재 요청을 이전 응답에 연결할 수 있습니다. 대화 기록을 관리하지 않으려면 이 매개 변수를 사용하는 것이 좋습니다.

이 매개 변수를 사용하지 않는 경우 이전 요청의 응답 출력에 반환된 모든 항목을 입력 배열에 포함해야 합니다. 여기에는 존재하는 경우 추론 항목이 포함됩니다.

안전 검사

API에는 프롬프트 주입 및 모델 실수로부터 보호하는 데 도움이 되는 안전 검사가 있습니다. 이들 검사는 다음과 같습니다.

악의적인 명령 검색: 시스템은 스크린샷 이미지를 평가하고 모델의 동작을 변경할 수 있는 악의적인 콘텐츠가 포함되어 있는지 확인합니다.
관련 없는 도메인 검색: 시스템은 (제공된 경우) 평가 current_url 하여 대화 기록을 고려할 때 현재 도메인이 관련 도메인으로 간주되는지 확인합니다.
중요한 도메인 검색: 시스템에서 (제공된 경우) 사용자가 current_url 중요한 도메인에 있는 것을 감지하면 경고를 발생합니다.

위의 검사 중 하나 이상이 트리거되면 모델이 computer_call 매개 변수와 함께 다음 pending_safety_checks을 반환할 때 안전 검사가 발생합니다.

"output": [
    {
        "type": "reasoning",
        "id": "rs_67cb...",
        "summary": [
            {
                "type": "summary_text",
                "text": "Exploring 'File' menu option."
            }
        ]
    },
    {
        "type": "computer_call",
        "id": "cu_67cb...",
        "call_id": "call_nEJ...",
        "action": {
            "type": "click",
            "button": "left",
            "x": 135,
            "y": 193
        },
        "pending_safety_checks": [
            {
                "id": "cu_sc_67cb...",
                "code": "malicious_instructions",
                "message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed."
            }
        ],
        "status": "completed"
    }
]

계속하려면 다음 요청에서와 같이 acknowledged_safety_checks 안전 검사를 다시 전달해야 합니다.

"input":[
        {
            "type": "computer_call_output",
            "call_id": "<call_id>",
            "acknowledged_safety_checks": [
                {
                    "id": "<safety_check_id>",
                    "code": "malicious_instructions",
                    "message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed."
                }
            ],
            "output": {
                "type": "computer_screenshot",
                "image_url": "<image_url>"
            }
        }
    ],

안전 점검 처리

반환되는 pending_safety_checks가 있는 모든 경우에는 적절한 모델 동작과 정확성을 확인하기 위해 작업을 최종 사용자에게 전달해야 합니다.

malicious_instructions 및 irrelevant_domain: 최종 사용자는 모델 작업을 검토하고 모델이 의도한 대로 작동하는지 확인해야 합니다.
sensitive_domain: 최종 사용자가 이러한 사이트에서 모델 작업을 적극적으로 모니터링하는지 확인합니다. 이 "조사식 모드"의 정확한 구현은 애플리케이션에 따라 다를 수 있지만, 잠재적인 예로는 사이트에서 사용자 노출 데이터를 수집하여 애플리케이션에 대한 활성 최종 사용자 참여가 있는지 확인할 수 있습니다.

극작가 통합

이 섹션에서는 기본 브라우저 상호 작용을 자동화하기 위해 Azure OpenAI의 computer-use-preview 모델을 Playwright 와 통합하는 간단한 예제 스크립트를 제공합니다. 모델을 Playwright 와 결합하면 모델이 브라우저 화면을 보고, 결정을 내리고, 웹 사이트 클릭, 입력 및 탐색과 같은 작업을 수행할 수 있습니다. 이 예제 코드를 실행할 때는 주의해야 합니다. 이 코드는 로컬로 실행되도록 설계되었지만 테스트 환경에서만 실행되어야 합니다. 인간을 사용하여 의사 결정을 확인하고 모델에 중요한 데이터에 대한 액세스 권한을 부여하지 않습니다.

먼저 Playwright용 Python 라이브러리를 설치해야 합니다.

pip install playwright

패키지가 설치되면 실행해야 합니다.

playwright install

가져오기 및 구성

먼저 필요한 라이브러리를 가져오고 구성 매개 변수를 정의합니다. 우리는 asyncio을(를) 사용하고 있으므로 Jupyter Notebook 이외의 환경에서 이 코드를 실행합니다. 먼저 코드를 청크로 안내한 다음 사용하는 방법을 보여 줍니다.

import os
import asyncio
import base64
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from playwright.async_api import async_playwright, TimeoutError

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)


# Configuration

BASE_URL = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/"
MODEL = "computer-use-preview" # Set to model deployment name
DISPLAY_WIDTH = 1024
DISPLAY_HEIGHT = 768
API_VERSION = "preview" #Use this API version or later
ITERATIONS = 5 # Max number of iterations before returning control to human supervisor

브라우저 상호 작용을 위한 키 매핑

다음으로, 모델이 Playwright에 전달해야 할 수 있는 특수 키에 대한 매핑을 설정합니다. 궁극적으로 모델은 작업 자체를 수행하지 않으며, 명령의 표현을 전달하며, 해당 명령을 사용하여 선택한 환경에서 실행할 수 있는 최종 통합 계층을 제공해야 합니다.

이는 가능한 키 매핑의 전체 목록이 아닙니다. 필요에 따라 이 목록을 확장할 수 있습니다. 이 사전은 모델을 Playwright와 통합을 위한 전용입니다. 모델을 대체 라이브러리와 통합하여 운영 체제 키보드/마우스에 대한 API 액세스를 제공하는 경우 해당 라이브러리와 관련된 매핑을 제공해야 합니다.

# Key mapping for special keys in Playwright
KEY_MAPPING = {
    "/": "Slash", "\\": "Backslash", "alt": "Alt", "arrowdown": "ArrowDown",
    "arrowleft": "ArrowLeft", "arrowright": "ArrowRight", "arrowup": "ArrowUp",
    "backspace": "Backspace", "ctrl": "Control", "delete": "Delete", 
    "enter": "Enter", "esc": "Escape", "shift": "Shift", "space": " ",
    "tab": "Tab", "win": "Meta", "cmd": "Meta", "super": "Meta", "option": "Alt"
}

이 사전은 사용자에게 친숙한 키 이름을 Playwright의 키보드 API에서 예상하는 형식으로 변환합니다.

좌표 유효성 검사 함수

모델에서 전달된 마우스 동작이 브라우저 창 경계 내에 유지되도록 하려면 다음 유틸리티 함수를 추가합니다.

def validate_coordinates(x, y):
    """Ensure coordinates are within display bounds."""
    return max(0, min(x, DISPLAY_WIDTH)), max(0, min(y, DISPLAY_HEIGHT))

이 간단한 유틸리티는 좌표를 창 차원으로 고정하여 범위를 벗어난 오류를 방지하려고 시도합니다.

작업 처리

브라우저 자동화의 핵심은 다양한 유형의 사용자 상호 작용을 처리하고 브라우저 내에서 작업으로 변환하는 작업 처리기입니다.

async def handle_action(page, action):
    """Handle different action types from the model."""
    action_type = action.type
    
    if action_type == "drag":
        print("Drag action is not supported in this implementation. Skipping.")
        return
        
    elif action_type == "click":
        button = getattr(action, "button", "left")
        # Validate coordinates
        x, y = validate_coordinates(action.x, action.y)
        
        print(f"\tAction: click at ({x}, {y}) with button '{button}'")
        
        if button == "back":
            await page.go_back()
        elif button == "forward":
            await page.go_forward()
        elif button == "wheel":
            await page.mouse.wheel(x, y)
        else:
            button_type = {"left": "left", "right": "right", "middle": "middle"}.get(button, "left")
            await page.mouse.click(x, y, button=button_type)
            try:
                await page.wait_for_load_state("domcontentloaded", timeout=3000)
            except TimeoutError:
                pass
        
    elif action_type == "double_click":
        # Validate coordinates
        x, y = validate_coordinates(action.x, action.y)
        
        print(f"\tAction: double click at ({x}, {y})")
        await page.mouse.dblclick(x, y)
        
    elif action_type == "scroll":
        scroll_x = getattr(action, "scroll_x", 0)
        scroll_y = getattr(action, "scroll_y", 0)
        # Validate coordinates
        x, y = validate_coordinates(action.x, action.y)
        
        print(f"\tAction: scroll at ({x}, {y}) with offsets ({scroll_x}, {scroll_y})")
        await page.mouse.move(x, y)
        await page.evaluate(f"window.scrollBy({{left: {scroll_x}, top: {scroll_y}, behavior: 'smooth'}});")
        
    elif action_type == "keypress":
        keys = getattr(action, "keys", [])
        print(f"\tAction: keypress {keys}")
        mapped_keys = [KEY_MAPPING.get(key.lower(), key) for key in keys]
        
        if len(mapped_keys) > 1:
            # For key combinations (like Ctrl+C)
            for key in mapped_keys:
                await page.keyboard.down(key)
            await asyncio.sleep(0.1)
            for key in reversed(mapped_keys):
                await page.keyboard.up(key)
        else:
            for key in mapped_keys:
                await page.keyboard.press(key)
                
    elif action_type == "type":
        text = getattr(action, "text", "")
        print(f"\tAction: type text: {text}")
        await page.keyboard.type(text, delay=20)
        
    elif action_type == "wait":
        ms = getattr(action, "ms", 1000)
        print(f"\tAction: wait {ms}ms")
        await asyncio.sleep(ms / 1000)
        
    elif action_type == "screenshot":
        print("\tAction: screenshot")
        
    else:
        print(f"\tUnrecognized action: {action_type}")

이 함수는 다양한 유형의 작업을 처리하려고 시도합니다. 생성할 명령 computer-use-preview 과 작업을 실행할 Playwright 라이브러리 간에 변환해야 합니다. 자세한 내용은 ComputerAction에 대한 설명서를 참조하세요.

스크린샷 캡처

모델이 모델과 상호 작용하는 것을 볼 수 있도록 하려면 스크린샷을 캡처하는 방법이 필요합니다. 이 코드에서는 Playwright를 사용하여 스크린샷을 캡처하고 브라우저 창의 콘텐츠로만 보기를 제한합니다. 스크린샷에는 URL 막대 또는 브라우저 GUI의 다른 측면이 포함되지 않습니다. 기본 브라우저 창 외부에서 모델이 표시되어야 하는 경우 고유한 스크린샷 함수를 만들어 모델을 보강할 수 있습니다.

async def take_screenshot(page):
    """Take a screenshot and return base64 encoding with caching for failures."""
    global last_successful_screenshot
    
    try:
        screenshot_bytes = await page.screenshot(full_page=False)
        last_successful_screenshot = base64.b64encode(screenshot_bytes).decode("utf-8")
        return last_successful_screenshot
    except Exception as e:
        print(f"Screenshot failed: {e}")
        print(f"Using cached screenshot from previous successful capture")
        if last_successful_screenshot:
            return last_successful_screenshot

이 함수는 현재 브라우저 상태를 이미지로 캡처하고 모델에 보낼 준비가 된 base64로 인코딩된 문자열로 반환합니다. 실행하려는 명령이 성공했는지 여부를 확인할 수 있도록 각 단계마다 루프에서 이 작업을 지속적으로 수행합니다. 그러면 스크린샷의 내용에 따라 조정할 수 있습니다. 모델이 스크린샷을 찍어야 하는지 여부를 결정하도록 할 수 있지만 간단히 하기 위해 각 반복에 대해 스크린샷을 강제로 생성합니다.

모델 응답 처리

이 함수는 모델의 응답을 처리하고 요청된 작업을 실행합니다.

async def process_model_response(client, response, page, max_iterations=ITERATIONS):
    """Process the model's response and execute actions."""
    for iteration in range(max_iterations):
        if not hasattr(response, 'output') or not response.output:
            print("No output from model.")
            break
        
        # Safely access response id
        response_id = getattr(response, 'id', 'unknown')
        print(f"\nIteration {iteration + 1} - Response ID: {response_id}\n")
        
        # Print text responses and reasoning
        for item in response.output:
            # Handle text output
            if hasattr(item, 'type') and item.type == "text":
                print(f"\nModel message: {item.text}\n")
                
            # Handle reasoning output
            if hasattr(item, 'type') and item.type == "reasoning":
                # Extract meaningful content from the reasoning
                meaningful_content = []
                
                if hasattr(item, 'summary') and item.summary:
                    for summary in item.summary:
                        # Handle different potential formats of summary content
                        if isinstance(summary, str) and summary.strip():
                            meaningful_content.append(summary)
                        elif hasattr(summary, 'text') and summary.text.strip():
                            meaningful_content.append(summary.text)
                
                # Only print reasoning section if there's actual content
                if meaningful_content:
                    print("=== Model Reasoning ===")
                    for idx, content in enumerate(meaningful_content, 1):
                        print(f"{content}")
                    print("=====================\n")
        
        # Extract computer calls
        computer_calls = [item for item in response.output 
                         if hasattr(item, 'type') and item.type == "computer_call"]
        
        if not computer_calls:
            print("No computer call found in response. Reverting control to human operator")
            break
        
        computer_call = computer_calls[0]
        if not hasattr(computer_call, 'call_id') or not hasattr(computer_call, 'action'):
            print("Computer call is missing required attributes.")
            break
        
        call_id = computer_call.call_id
        action = computer_call.action
        
        # Handle safety checks
        acknowledged_checks = []
        if hasattr(computer_call, 'pending_safety_checks') and computer_call.pending_safety_checks:
            pending_checks = computer_call.pending_safety_checks
            print("\nSafety checks required:")
            for check in pending_checks:
                print(f"- {check.code}: {check.message}")
            
            if input("\nDo you want to proceed? (y/n): ").lower() != 'y':
                print("Operation cancelled by user.")
                break
            
            acknowledged_checks = pending_checks
        
        # Execute the action
        try:
           await page.bring_to_front()
           await handle_action(page, action)
           
           # Check if a new page was created after the action
           if action.type in ["click"]:
               await asyncio.sleep(1.5)
               # Get all pages in the context
               all_pages = page.context.pages
               # If we have multiple pages, check if there's a newer one
               if len(all_pages) > 1:
                   newest_page = all_pages[-1]  # Last page is usually the newest
                   if newest_page != page and newest_page.url not in ["about:blank", ""]:
                       print(f"\tSwitching to new tab: {newest_page.url}")
                       page = newest_page  # Update our page reference
           elif action.type != "wait":
               await asyncio.sleep(0.5)
               
        except Exception as e:
           print(f"Error handling action {action.type}: {e}")
           import traceback
           traceback.print_exc()    

        # Take a screenshot after the action
        screenshot_base64 = await take_screenshot(page)

        print("\tNew screenshot taken")
        
        # Prepare input for the next request
        input_content = [{
            "type": "computer_call_output",
            "call_id": call_id,
            "output": {
                "type": "input_image",
                "image_url": f"data:image/png;base64,{screenshot_base64}"
            }
        }]
        
        # Add acknowledged safety checks if any
        if acknowledged_checks:
            acknowledged_checks_dicts = []
            for check in acknowledged_checks:
                acknowledged_checks_dicts.append({
                    "id": check.id,
                    "code": check.code,
                    "message": check.message
                })
            input_content[0]["acknowledged_safety_checks"] = acknowledged_checks_dicts
        
        # Add current URL for context
        try:
            current_url = page.url
            if current_url and current_url != "about:blank":
                input_content[0]["current_url"] = current_url
                print(f"\tCurrent URL: {current_url}")
        except Exception as e:
            print(f"Error getting URL: {e}")
        
        # Send the screenshot back for the next step
        try:
            response = client.responses.create(
                model=MODEL,
                previous_response_id=response_id,
                tools=[{
                    "type": "computer_use_preview",
                    "display_width": DISPLAY_WIDTH,
                    "display_height": DISPLAY_HEIGHT,
                    "environment": "browser"
                }],
                input=input_content,
                truncation="auto"
            )

            print("\tModel processing screenshot")
        except Exception as e:
            print(f"Error in API call: {e}")
            import traceback
            traceback.print_exc()
            break
    
    if iteration >= max_iterations - 1:
        print("Reached maximum number of iterations. Stopping.")

이 섹션에서는 다음 코드를 추가했습니다.

모델에서 텍스트 및 추론을 추출하고 표시합니다.
컴퓨터 작업 호출을 처리합니다.
사용자 확인이 필요한 잠재적인 안전 검사를 처리합니다.
요청된 작업을 실행합니다.
새 스크린샷을 캡처합니다.
업데이트된 상태를 모델로 다시 보내고 ComputerTool을(를) 정의합니다.
이 프로세스를 여러 번 반복합니다.

주요 기능

주 함수는 전체 프로세스를 조정합니다.

    # Initialize OpenAI client
    client = AzureOpenAI(
        base_url=BASE_URL,
        azure_ad_token_provider=token_provider,
        api_version=API_VERSION
    )
    
    # Initialize Playwright
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch(
            headless=False,
            args=[f"--window-size={DISPLAY_WIDTH},{DISPLAY_HEIGHT}", "--disable-extensions"]
        )
        
        context = await browser.new_context(
            viewport={"width": DISPLAY_WIDTH, "height": DISPLAY_HEIGHT},
            accept_downloads=True
        )
        
        page = await context.new_page()
        
        # Navigate to starting page
        await page.goto("https://www.bing.com", wait_until="domcontentloaded")
        print("Browser initialized to Bing.com")
        
        # Main interaction loop
        try:
            while True:
                print("\n" + "="*50)
                user_input = input("Enter a task to perform (or 'exit' to quit): ")
                
                if user_input.lower() in ('exit', 'quit'):
                    break
                
                if not user_input.strip():
                    continue
                
                # Take initial screenshot
                screenshot_base64 = await take_screenshot(page)
                print("\nTake initial screenshot")
                
                # Initial request to the model
                response = client.responses.create(
                    model=MODEL,
                    tools=[{
                        "type": "computer_use_preview",
                        "display_width": DISPLAY_WIDTH,
                        "display_height": DISPLAY_HEIGHT,
                        "environment": "browser"
                    }],
                    instructions = "You are an AI agent with the ability to control a browser. You can control the keyboard and mouse. You take a screenshot after each action to check if your action was successful. Once you have completed the requested task you should stop running and pass back control to your human operator.",
                    input=[{
                        "role": "user",
                        "content": [{
                            "type": "input_text",
                            "text": user_input
                        }, {
                            "type": "input_image",
                            "image_url": f"data:image/png;base64,{screenshot_base64}"
                        }]
                    }],
                    reasoning={"generate_summary": "concise"},
                    truncation="auto"
                )
                print("\nSending model initial screenshot and instructions")

                # Process model actions
                await process_model_response(client, response, page)
                
        except Exception as e:
            print(f"An error occurred: {e}")
            import traceback
            traceback.print_exc()
        
        finally:
            # Close browser
            await context.close()
            await browser.close()
            print("Browser closed.")

if __name__ == "__main__":
    asyncio.run(main())

main 함수는 다음과 같습니다.

AzureOpenAI 클라이언트를 초기화합니다.
Playwright 브라우저를 설정합니다.
시작 위치는 Bing.com입니다.
사용자 작업을 수락하는 루프를 입력합니다.
초기 상태를 캡처합니다.
작업 및 스크린샷을 모델에 보냅니다.
모델의 응답을 처리합니다.
사용자가 종료될 때까지 반복됩니다.
브라우저가 제대로 닫혀 있는지 확인합니다.

전체 스크립트

주의

이 코드는 실험적이며 데모용으로만 사용됩니다. 응답 API 및 모델의 기본 흐름을 설명하기 위한 것입니다 computer-use-preview . 로컬 컴퓨터에서 이 코드를 실행할 수 있지만 중요한 데이터에 액세스할 수 없는 낮은 권한의 가상 머신에서 이 코드를 실행하는 것이 좋습니다. 이 코드는 기본 테스트용으로만 사용됩니다.

import os
import asyncio
import base64
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from playwright.async_api import async_playwright, TimeoutError


token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

# Configuration

BASE_URL = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/"
MODEL = "computer-use-preview"
DISPLAY_WIDTH = 1024
DISPLAY_HEIGHT = 768
API_VERSION = "preview"
ITERATIONS = 5 # Max number of iterations before forcing the model to return control to the human supervisor

# Key mapping for special keys in Playwright
KEY_MAPPING = {
    "/": "Slash", "\\": "Backslash", "alt": "Alt", "arrowdown": "ArrowDown",
    "arrowleft": "ArrowLeft", "arrowright": "ArrowRight", "arrowup": "ArrowUp",
    "backspace": "Backspace", "ctrl": "Control", "delete": "Delete", 
    "enter": "Enter", "esc": "Escape", "shift": "Shift", "space": " ",
    "tab": "Tab", "win": "Meta", "cmd": "Meta", "super": "Meta", "option": "Alt"
}

def validate_coordinates(x, y):
    """Ensure coordinates are within display bounds."""
    return max(0, min(x, DISPLAY_WIDTH)), max(0, min(y, DISPLAY_HEIGHT))

async def handle_action(page, action):
    """Handle different action types from the model."""
    action_type = action.type
    
    if action_type == "drag":
        print("Drag action is not supported in this implementation. Skipping.")
        return
        
    elif action_type == "click":
        button = getattr(action, "button", "left")
        # Validate coordinates
        x, y = validate_coordinates(action.x, action.y)
        
        print(f"\tAction: click at ({x}, {y}) with button '{button}'")
        
        if button == "back":
            await page.go_back()
        elif button == "forward":
            await page.go_forward()
        elif button == "wheel":
            await page.mouse.wheel(x, y)
        else:
            button_type = {"left": "left", "right": "right", "middle": "middle"}.get(button, "left")
            await page.mouse.click(x, y, button=button_type)
            try:
                await page.wait_for_load_state("domcontentloaded", timeout=3000)
            except TimeoutError:
                pass
        
    elif action_type == "double_click":
        # Validate coordinates
        x, y = validate_coordinates(action.x, action.y)
        
        print(f"\tAction: double click at ({x}, {y})")
        await page.mouse.dblclick(x, y)
        
    elif action_type == "scroll":
        scroll_x = getattr(action, "scroll_x", 0)
        scroll_y = getattr(action, "scroll_y", 0)
        # Validate coordinates
        x, y = validate_coordinates(action.x, action.y)
        
        print(f"\tAction: scroll at ({x}, {y}) with offsets ({scroll_x}, {scroll_y})")
        await page.mouse.move(x, y)
        await page.evaluate(f"window.scrollBy({{left: {scroll_x}, top: {scroll_y}, behavior: 'smooth'}});")
        
    elif action_type == "keypress":
        keys = getattr(action, "keys", [])
        print(f"\tAction: keypress {keys}")
        mapped_keys = [KEY_MAPPING.get(key.lower(), key) for key in keys]
        
        if len(mapped_keys) > 1:
            # For key combinations (like Ctrl+C)
            for key in mapped_keys:
                await page.keyboard.down(key)
            await asyncio.sleep(0.1)
            for key in reversed(mapped_keys):
                await page.keyboard.up(key)
        else:
            for key in mapped_keys:
                await page.keyboard.press(key)
                
    elif action_type == "type":
        text = getattr(action, "text", "")
        print(f"\tAction: type text: {text}")
        await page.keyboard.type(text, delay=20)
        
    elif action_type == "wait":
        ms = getattr(action, "ms", 1000)
        print(f"\tAction: wait {ms}ms")
        await asyncio.sleep(ms / 1000)
        
    elif action_type == "screenshot":
        print("\tAction: screenshot")
        
    else:
        print(f"\tUnrecognized action: {action_type}")

async def take_screenshot(page):
    """Take a screenshot and return base64 encoding with caching for failures."""
    global last_successful_screenshot
    
    try:
        screenshot_bytes = await page.screenshot(full_page=False)
        last_successful_screenshot = base64.b64encode(screenshot_bytes).decode("utf-8")
        return last_successful_screenshot
    except Exception as e:
        print(f"Screenshot failed: {e}")
        print(f"Using cached screenshot from previous successful capture")
        if last_successful_screenshot:
            return last_successful_screenshot


async def process_model_response(client, response, page, max_iterations=ITERATIONS):
    """Process the model's response and execute actions."""
    for iteration in range(max_iterations):
        if not hasattr(response, 'output') or not response.output:
            print("No output from model.")
            break
        
        # Safely access response id
        response_id = getattr(response, 'id', 'unknown')
        print(f"\nIteration {iteration + 1} - Response ID: {response_id}\n")
        
        # Print text responses and reasoning
        for item in response.output:
            # Handle text output
            if hasattr(item, 'type') and item.type == "text":
                print(f"\nModel message: {item.text}\n")
                
            # Handle reasoning output
            if hasattr(item, 'type') and item.type == "reasoning":
                # Extract meaningful content from the reasoning
                meaningful_content = []
                
                if hasattr(item, 'summary') and item.summary:
                    for summary in item.summary:
                        # Handle different potential formats of summary content
                        if isinstance(summary, str) and summary.strip():
                            meaningful_content.append(summary)
                        elif hasattr(summary, 'text') and summary.text.strip():
                            meaningful_content.append(summary.text)
                
                # Only print reasoning section if there's actual content
                if meaningful_content:
                    print("=== Model Reasoning ===")
                    for idx, content in enumerate(meaningful_content, 1):
                        print(f"{content}")
                    print("=====================\n")
        
        # Extract computer calls
        computer_calls = [item for item in response.output 
                         if hasattr(item, 'type') and item.type == "computer_call"]
        
        if not computer_calls:
            print("No computer call found in response. Reverting control to human supervisor")
            break
        
        computer_call = computer_calls[0]
        if not hasattr(computer_call, 'call_id') or not hasattr(computer_call, 'action'):
            print("Computer call is missing required attributes.")
            break
        
        call_id = computer_call.call_id
        action = computer_call.action
        
        # Handle safety checks
        acknowledged_checks = []
        if hasattr(computer_call, 'pending_safety_checks') and computer_call.pending_safety_checks:
            pending_checks = computer_call.pending_safety_checks
            print("\nSafety checks required:")
            for check in pending_checks:
                print(f"- {check.code}: {check.message}")
            
            if input("\nDo you want to proceed? (y/n): ").lower() != 'y':
                print("Operation cancelled by user.")
                break
            
            acknowledged_checks = pending_checks
        
        # Execute the action
        try:
           await page.bring_to_front()
           await handle_action(page, action)
           
           # Check if a new page was created after the action
           if action.type in ["click"]:
               await asyncio.sleep(1.5)
               # Get all pages in the context
               all_pages = page.context.pages
               # If we have multiple pages, check if there's a newer one
               if len(all_pages) > 1:
                   newest_page = all_pages[-1]  # Last page is usually the newest
                   if newest_page != page and newest_page.url not in ["about:blank", ""]:
                       print(f"\tSwitching to new tab: {newest_page.url}")
                       page = newest_page  # Update our page reference
           elif action.type != "wait":
               await asyncio.sleep(0.5)
               
        except Exception as e:
           print(f"Error handling action {action.type}: {e}")
           import traceback
           traceback.print_exc()    

        # Take a screenshot after the action
        screenshot_base64 = await take_screenshot(page)

        print("\tNew screenshot taken")
        
        # Prepare input for the next request
        input_content = [{
            "type": "computer_call_output",
            "call_id": call_id,
            "output": {
                "type": "input_image",
                "image_url": f"data:image/png;base64,{screenshot_base64}"
            }
        }]
        
        # Add acknowledged safety checks if any
        if acknowledged_checks:
            acknowledged_checks_dicts = []
            for check in acknowledged_checks:
                acknowledged_checks_dicts.append({
                    "id": check.id,
                    "code": check.code,
                    "message": check.message
                })
            input_content[0]["acknowledged_safety_checks"] = acknowledged_checks_dicts
        
        # Add current URL for context
        try:
            current_url = page.url
            if current_url and current_url != "about:blank":
                input_content[0]["current_url"] = current_url
                print(f"\tCurrent URL: {current_url}")
        except Exception as e:
            print(f"Error getting URL: {e}")
        
        # Send the screenshot back for the next step
        try:
            response = client.responses.create(
                model=MODEL,
                previous_response_id=response_id,
                tools=[{
                    "type": "computer_use_preview",
                    "display_width": DISPLAY_WIDTH,
                    "display_height": DISPLAY_HEIGHT,
                    "environment": "browser"
                }],
                input=input_content,
                truncation="auto"
            )

            print("\tModel processing screenshot")
        except Exception as e:
            print(f"Error in API call: {e}")
            import traceback
            traceback.print_exc()
            break
    
    if iteration >= max_iterations - 1:
        print("Reached maximum number of iterations. Stopping.")
        
async def main():    
    # Initialize OpenAI client
    client = AzureOpenAI(
        base_url=BASE_URL,
        azure_ad_token_provider=token_provider,
        api_version=API_VERSION
    )
    
    # Initialize Playwright
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch(
            headless=False,
            args=[f"--window-size={DISPLAY_WIDTH},{DISPLAY_HEIGHT}", "--disable-extensions"]
        )
        
        context = await browser.new_context(
            viewport={"width": DISPLAY_WIDTH, "height": DISPLAY_HEIGHT},
            accept_downloads=True
        )
        
        page = await context.new_page()
        
        # Navigate to starting page
        await page.goto("https://www.bing.com", wait_until="domcontentloaded")
        print("Browser initialized to Bing.com")
        
        # Main interaction loop
        try:
            while True:
                print("\n" + "="*50)
                user_input = input("Enter a task to perform (or 'exit' to quit): ")
                
                if user_input.lower() in ('exit', 'quit'):
                    break
                
                if not user_input.strip():
                    continue
                
                # Take initial screenshot
                screenshot_base64 = await take_screenshot(page)
                print("\nTake initial screenshot")
                
                # Initial request to the model
                response = client.responses.create(
                    model=MODEL,
                    tools=[{
                        "type": "computer_use_preview",
                        "display_width": DISPLAY_WIDTH,
                        "display_height": DISPLAY_HEIGHT,
                        "environment": "browser"
                    }],
                    instructions = "You are an AI agent with the ability to control a browser. You can control the keyboard and mouse. You take a screenshot after each action to check if your action was successful. Once you have completed the requested task you should stop running and pass back control to your human supervisor.",
                    input=[{
                        "role": "user",
                        "content": [{
                            "type": "input_text",
                            "text": user_input
                        }, {
                            "type": "input_image",
                            "image_url": f"data:image/png;base64,{screenshot_base64}"
                        }]
                    }],
                    reasoning={"generate_summary": "concise"},
                    truncation="auto"
                )
                print("\nSending model initial screenshot and instructions")

                # Process model actions
                await process_model_response(client, response, page)
                
        except Exception as e:
            print(f"An error occurred: {e}")
            import traceback
            traceback.print_exc()
        
        finally:
            # Close browser
            await context.close()
            await browser.close()
            print("Browser closed.")

if __name__ == "__main__":
    asyncio.run(main())