OpenAI Vision Token 개수 측정

crossfit_wod 2024. 11. 5. 18:10

완성된 코드

회사에서 OpenAI Vision을 사용할 때 어느정도의 토큰이 나오는지 계산이 필요했습니다. 따라서 Vision의 토큰 개수 구하는 방법을 알아보겠습니다.

Vision

먼저 vision의 토큰 개수 구하는 방법은 공식문서에 나와있습니다.

요약하면, Vision의 토큰 개수는 이미지의 크기와 디테일 옵션에 따라서 결정이 됩니다.

detail option

Vision을 요청할 때는 아래와 같이 요청이 가능합니다. 주목할 부분은 messages.content.image_url.detail입니다.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            "detail": "high"
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)

low

공식문서에는 detail은 low, high, auto가 있습니다. low의 경우, 고정적으로 85개의 토큰이 나옵니다.

high

high의 경우에는 사이즈에 따라서 달라지게 됩니다. 따라서 토큰의 개수가 low보다 많이 나오게 됩니다.

auto

모델이 이미지의 크기와 중요도에 따라 자동으로 low 또는 high 디테일 모드를 선택합니다.

image size

이미지 사이즈에 따라서 detail의 영향이 달라집니다. low의 경우 사이즈에 상관없이 무조건 85개의 토큰으로 고정됩니다.

반면에 detail이 high로 구성된 경우 아래와 같은 순서를 따라갑니다. 또한 high는 85개의 토큰이 디폴트가 되고, 추가적으로 주어진 이미지를 512px X 512px 타일의 개수로 덮는 N개의 개수만큼 170와 곱해서 85와 더합니다. 85 + (170 * N)

2048 x 2048 정사각형에 맞춰 크기 조정 : 이미지의 비율을 유지하면서 가로 또는 세로가 2048을 넘지 않도록 크기를 조정합니다.
최단 변을 768px로 조정 : 크기 조정 후 최단 변이 768px이 되도록 축소합니다.
512px x 512px 타일로 나누어 토큰 계산 : 최종 조정된 이미지 크기에서 512px x 512px 타일 수를 계산합니다. 각 타일은 170 토큰의 비용이 발생합니다.
고정 비용 85 토큰 추가 : 마지막에 항상 85 토큰이 추가됩니다.

결론

detail: low는 이미지 크기에 관계없이 항상 85 토큰
detail: high는 이미지 크기와 비율에 따라 조정 및 타일링을 거치며, 타일당 170 토큰과 추가 85 토큰을 포함한 동적인 토큰 비용이 발생

따라서 토큰을 계산하는 파이썬 코드는 아래와 같습니다. 각 축소된 사이즈와 그에 따른 토큰 개수에 주목해주시면 됩니다. 모든 코드에서 이미지는 width=3024 height=4032로 구성했습니다.

긴 변의 길이와 짧은 변의 길이 제한 사항을 적용해서 축소해서 Vision 계산

from openai import OpenAI
from math import ceil
import os
import tiktoken
import sys

sys.stdout.reconfigure(encoding='utf-8')

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "{당신의 SECRET_KEY를 입력하셈 ㅇㅇ}"))

print('---------------------------------------------------------------------------------------------')
print('2가지 제한 사항 모두 반영 / 이미지 토큰 개수 구하는 공식---------------------------------------------------------------------------------------------')
def adjust_image_size(width: int, height: int):
   # Step 1: 초기 크기가 2048을 초과하는 경우, 비율을 유지하며 조정
   if width > 2048 or height > 2048:
      aspect_ratio = width / height
      if width > height:
         width, height = 2048, int(2048 / aspect_ratio)
      else:
         width, height = int(2048 * aspect_ratio), 2048
   
   # Step 2: 최단 변이 768 이상인 경우, 다시 비율을 유지하며 크기를 조정
   if width >= height and height > 768:
      width, height = int((768 / height) * width), 768
   elif height > width and width > 768:
      width, height = 768, int((768 / width) * height)
   print('변환 후의 width, height', width, height)
   return width, height

def calculate_gpt4o_image_tokens(width: int, height: int, detail: str = "high"):
   # low 디테일일 경우 고정된 85 토큰 반환
   if detail == "low":
      return 85

   # high 디테일일 경우 크기 조정 및 타일 계산
   width, height = adjust_image_size(width, height)
   
   # 타일 수 계산
   tiles_width = ceil(width / 512)
   tiles_height = ceil(height / 512)
   
   # 토큰 개수 계산
   total_tokens = 85 + 170 * (tiles_width * tiles_height)
   
   return total_tokens

width = 3024
height = 4032

print("high 디테일 모드 - 이미지의 토큰 개수:", calculate_gpt4o_image_tokens(width, height, detail="high"))
print("low 디테일 모드 - 이미지의 토큰 개수:", calculate_gpt4o_image_tokens(width, height, detail="low"))

가장 긴 변의 길이만 반영해서 축소해서 Vision 계산

print('---------------------------------------------------------------------------------------------')
print('가장 긴 변만 반영 / 이미지 토큰 개수 구하는 공식---------------------------------------------------------------------------------------------')
def adjust_image_size(width: int, height: int):
   # Step 1: 가장 긴 변을 2048로 맞추고, 비율에 맞게 조정
   if width > height:
      if width > 2048:
         aspect_ratio = height / width
         width, height = 2048, int(2048 * aspect_ratio)
   else:
      if height > 2048:
         aspect_ratio = width / height
         width, height = int(2048 * aspect_ratio), 2048
   
   print('변환 후의 width, height', width, height)
   return width, height

def calculate_gpt4o_image_tokens(width: int, height: int, detail: str = "high"):
   # low 디테일일 경우 고정된 85 토큰 반환
   if detail == "low":
      return 85

   # high 디테일일 경우 크기 조정 및 타일 계산
   width, height = adjust_image_size(width, height)
   
   # 타일 수 계산
   tiles_width = ceil(width / 512)
   tiles_height = ceil(height / 512)
   
   # 토큰 개수 계산
   total_tokens = 85 + 170 * (tiles_width * tiles_height)
   
   return total_tokens

width = 3024
height = 4032

print("high 디테일 모드 - 이미지의 토큰 개수:", calculate_gpt4o_image_tokens(width, height, detail="high"))
print("low 디테일 모드 - 이미지의 토큰 개수:", calculate_gpt4o_image_tokens(width, height, detail="low"))

아무런 제한사항 반영없이 Vision 토큰 개산

print('---------------------------------------------------------------------------------------------')
print('이미지 토큰 개수 구하는 공식---------------------------------------------------------------------------------------------')
def calculate_gpt4o_image_tokens(width: int, height: int, detail: str = "high"):
   # low 디테일일 경우 고정된 85 토큰 반환
   if detail == "low":
      return 85

   # high 디테일일 경우, 이미지 크기 조정 없이 타일 수 계산
   tiles_width = ceil(width / 512)
   tiles_height = ceil(height / 512)
   
   # 토큰 개수 계산
   total_tokens = 85 + 170 * (tiles_width * tiles_height)
   
   return total_tokens

width = 3024
height = 4032

print("high 디테일 모드 - 이미지의 토큰 개수:", calculate_gpt4o_image_tokens(width, height, detail="high"))
print("low 디테일 모드 - 이미지의 토큰 개수:", calculate_gpt4o_image_tokens(width, height, detail="low"))

gpt버전에 따른 토큰 개수의 제한

gpt 버전에 따른 토큰 개수의 제한이 있습니다. 공식문서를 보시면 됩니다. 따라서 각 모델에 맞게 계산을 해서 토큰 개수를 넘지 않게 만들어야 합니다.(안 그러면 에러 터지거든요 ㅎ 경험담)

context window : 기억할 수 있는 최대 토큰 수,
max output tokens : 모델이 한 번에 생성할 수 있는 최대 출력 토큰 수
gpt-4o-2024-05-13 : 4,096 tokens
gpt-4o-mini : 16,384 tokens

결론적으로 System prompt token + User prompt token + vision token + output token 값이 해당하는 token보다 이하가 되어야 합니다. 예를 들어, gpt-4o-mini의 경우 System prompt token + User prompt token + vision token + output token <= 16,384 tokens이 되어야 합니다.

TMI

Token 구분 방법

OpenAI에서 token을 구분하는 방법이 궁금해서 공식문서를 보면서 따라 만들어 봤습니다. 숫자 배열 형태로 받는 부분(token_integers)을 보면 공통적인 숫자가 있는 경우도 있는데, 이 경우는 같은 글자 또는 여백일 경우에 해당합니다.

token_integers로 받는 이유는 일반적으로 텍스트를 OpenAI모델이 이해할 수 있는 숫자 토큰 시퀀스로 변환할 때 사용을 하기 때문입니다.

import tiktoken
import sys

sys.stdout.reconfigure(encoding='utf-8')

def num_tokens_from_string(string: str, encoding_name: str) -> int:
   """Returns the number of tokens in a text string."""
   encoding = tiktoken.get_encoding(encoding_name)
   num_tokens = len(encoding.encode(string))
   return num_tokens

def compare_encodings(example_string: str) -> None:
   """Prints a comparison of three string encodings."""
   # print the example string
   print(f'\nExample string: "{example_string}"')
   # for each encoding, print the # of tokens, the token integers, and the token bytes
   for encoding_name in ["r50k_base", "p50k_base", "cl100k_base", "o200k_base"]:
      encoding = tiktoken.get_encoding(encoding_name)
      token_integers = encoding.encode(example_string)
      num_tokens = len(token_integers)
      token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
      print()
      print(f"{encoding_name}: {num_tokens} tokens")
      print(f"token integers: {token_integers}")
      print(f"token bytes: {token_bytes}")

def num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18"):
   """Return the number of tokens used by a list of messages."""
   try:
      encoding = tiktoken.encoding_for_model(model)
   except KeyError:
      print("Warning: model not found. Using o200k_base encoding.")
      encoding = tiktoken.get_encoding("o200k_base")
   if model in {
      "gpt-3.5-turbo-0125",
      "gpt-4-0314",
      "gpt-4-32k-0314",
      "gpt-4-0613",
      "gpt-4-32k-0613",
      "gpt-4o-mini-2024-07-18",
      "gpt-4o-2024-08-06"
      }:
      tokens_per_message = 3
      tokens_per_name = 1
   elif "gpt-3.5-turbo" in model:
      print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.")
      return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0125")
   elif "gpt-4o-mini" in model:
      print("Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.")
      return num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18")
   elif "gpt-4o" in model:
      print("Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.")
      return num_tokens_from_messages(messages, model="gpt-4o-2024-08-06")
   elif "gpt-4" in model:
      print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
      return num_tokens_from_messages(messages, model="gpt-4-0613")
   else:
      raise NotImplementedError(
         f"""num_tokens_from_messages() is not implemented for model {model}."""
      )
   num_tokens = 0
   for message in messages:
      num_tokens += tokens_per_message
      for key, value in message.items():
         num_tokens += len(encoding.encode(value))
         if key == "name":
               num_tokens += tokens_per_name
   num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
   return num_tokens

중간에 보면 ["r50k_base", "p50k_base", "cl100k_base", "o200k_base"] 부분이 있습니다. 이 배열은 공식문서에 나와있습니다.

print(compare_encodings("antidisestablishmentarianism")) # 1
print(compare_encodings("2 + 2 = 4")) # 2

japanese = "お誕生日おめでとう".encode('utf-8', errors='ignore').decode('utf-8')
print(compare_encodings(japanese)) # 3

Example string: "antidisestablishmentarianism"

r50k_base: 5 tokens

token integers: [415, 29207, 44390, 3699, 1042]

token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']

p50k_base: 5 tokens

token integers: [415, 29207, 44390, 3699, 1042]

token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']

cl100k_base: 6 tokens

token integers: [519, 85342, 34500, 479, 8997, 2191]

token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']

o200k_base: 6 tokens

token integers: [493, 129901, 376, 160388, 21203, 2367]

token bytes: [b'ant', b'idis', b'est', b'ablishment', b'arian', b'ism']

Example string: "2 + 2 = 4"

r50k_base: 5 tokens

token integers: [17, 1343, 362, 796, 604]

token bytes: [b'2', b' +', b' 2', b' =', b' 4']

p50k_base: 5 tokens

token integers: [17, 1343, 362, 796, 604]

token bytes: [b'2', b' +', b' 2', b' =', b' 4']

cl100k_base: 7 tokens

token integers: [17, 489, 220, 17, 284, 220, 19]

token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']

o200k_base: 7 tokens

token integers: [17, 659, 220, 17, 314, 220, 19]

token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']

Example string: "お誕生日おめでとう"

r50k_base: 14 tokens

token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]

token bytes: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']

p50k_base: 14 tokens

token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]

token bytes: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']

cl100k_base: 9 tokens

token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]

token bytes: [b'\xe3\x81\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97\xa5', b'\xe3\x81\x8a', b'\xe3\x82\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8\xe3\x81\x86']

o200k_base: 8 tokens

token integers: [8930, 9697, 243, 128225, 8930, 17693, 4344, 48669]

token bytes: [b'\xe3\x81\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f\xe6\x97\xa5', b'\xe3\x81\x8a', b'\xe3\x82\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8\xe3\x81\x86']

gpt별 토큰 수

토큰 구분하는 방법을 알았으니, gpt별 토큰 수 측정하는 방법의 코드를 보겠습니다.

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "시크릿키 넣으셈 ㅇㅇ"))
example_messages = [
   { # 608
      "role": "system",
      "content": "super power gd~~~~~~~~asdasd....."
   },
   {  # 610
      "role": "user",
      "content": "asdasdasd......"
   },
]

for model in [
   "gpt-3.5-turbo",
   "gpt-4-0613",
   "gpt-4",
   "gpt-4o",
   "gpt-4o-mini"
   ]:
   print(model)
   # example token count from the function defined above
   print(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")
   # example token count from the OpenAI API
   response = client.chat.completions.create(model=model,
   messages=example_messages,
   temperature=0,
   max_tokens=1)
   print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')
   print()

print('---------------------------------------------------------------------------------------------')
print('메소드 사용시 토큰 개수---------------------------------------------------------------------------------------------')
def num_tokens_for_tools(functions, messages, model):
   # Initialize function settings to 0
   func_init = 0
   prop_init = 0
   prop_key = 0
   enum_init = 0
   enum_item = 0
   func_end = 0
   
   if model in [
      "gpt-4o",
      "gpt-4o-mini"
   ]:
      
      # Set function settings for the above models
      func_init = 7
      prop_init = 3
      prop_key = 3
      enum_init = -3
      enum_item = 3
      func_end = 12
   elif model in [
      "gpt-3.5-turbo",
      "gpt-4"
   ]:
      # Set function settings for the above models
      func_init = 10
      prop_init = 3
      prop_key = 3
      enum_init = -3
      enum_item = 3
      func_end = 12
   else:
      raise NotImplementedError(
         f"""num_tokens_for_tools() is not implemented for model {model}."""
      )
   
   try:
      encoding = tiktoken.encoding_for_model(model)
   except KeyError:
      print("Warning: model not found. Using o200k_base encoding.")
      encoding = tiktoken.get_encoding("o200k_base")
   
   func_token_count = 0
   if len(functions) > 0:
      for f in functions:
         func_token_count += func_init  # Add tokens for start of each function
         function = f["function"]
         f_name = function["name"]
         f_desc = function["description"]
         if f_desc.endswith("."):
               f_desc = f_desc[:-1]
         line = f_name + ":" + f_desc
         func_token_count += len(encoding.encode(line))  # Add tokens for set name and description
         if len(function["parameters"]["properties"]) > 0:
               func_token_count += prop_init  # Add tokens for start of each property
               for key in list(function["parameters"]["properties"].keys()):
                  func_token_count += prop_key  # Add tokens for each set property
                  p_name = key
                  p_type = function["parameters"]["properties"][key]["type"]
                  p_desc = function["parameters"]["properties"][key]["description"]
                  if "enum" in function["parameters"]["properties"][key].keys():
                     func_token_count += enum_init  # Add tokens if property has enum list
                     for item in function["parameters"]["properties"][key]["enum"]:
                           func_token_count += enum_item
                           func_token_count += len(encoding.encode(item))
                  if p_desc.endswith("."):
                     p_desc = p_desc[:-1]
                  line = f"{p_name}:{p_type}:{p_desc}"
                  func_token_count += len(encoding.encode(line))
      func_token_count += func_end
      
   messages_token_count = num_tokens_from_messages(messages, model)
   total_tokens = messages_token_count + func_token_count
   
   return total_tokens