image_lingo.json

Overview

The image_lingo.json file defines the configuration and DSL (Domain-Specific Language) specification for ImageLingo, a consumer application designed to perform real-time image-based text recognition and multilingual translation. The core functionality of ImageLingo is to allow users to snap photos containing text—such as menus, signs, or documents—and instantly extract, recognize, and translate the text into a target language. It leverages advanced AI-powered OCR (Optical Character Recognition) and translation technology with context-aware adaptations for natural and meaningful translations.

This file primarily serves as a declarative specification describing:

The application's metadata (e.g., id, title, description)
The DSL graph that models the flow of components and messages
Component configurations, including an AI Agent responsible for image-to-text translation
Global system variables and message passing structure

Detailed Explanation

Metadata Section

id: Unique identifier of this app configuration (13).
title: Localized titles of the app.
- "en": "ImageLingo"
- "zh": "图片解析" (Image analysis)
description: Localized descriptions explaining the app’s purpose.
- "en": Describes ImageLingo’s ability to snap photos and translate text instantly using AI.
- "zh": Similar description in Chinese emphasizing multimodal large-model capabilities.
canvas_type: "Consumer App" indicating the target platform or user context.

DSL Section (Core Application Logic)

The DSL models ImageLingo’s workflow using interconnected components. Each component has configurations and parameters governing its behavior.

1. Components

a. Begin Component (`begin`)

Type: Begin
Purpose: Starting node that initiates the user interaction.
Parameters:
- enablePrologue: true — enables an introductory message.
- prologue: A friendly greeting string introducing ImageLingo to the user.
- mode: "task" — denotes the mode of operation.
Downstream: Passes control to the Agent component.
Upstream: None (start of the graph).

Usage Example:
When the app launches, it greets the user:

"Hi there! I’m ImageLingo, your on-the-go image translation assistant—just snap a photo, and I’ll instantly translate and adapt it into your language."

b. Agent:CoolPandasCrash (Translation Agent with Vision)

Type: Agent
Purpose: The AI-powered translation assistant that accepts image files, extracts text, detects source and target languages, and produces a translated output.
Parameters:
- llm_id: "qwen-vl-plus@Tongyi-Qianwen" — specifies the large language model used.
- llm_filter: "image2text" — filters the model capabilities focused on image-to-text tasks.
- max_tokens: 256 — max token limit for LLM responses.
- max_retries: 3 — retries on failure.
- max_rounds: 5 — max conversational turns.
- temperature: 0.1 — low randomness for stable outputs.
- frequency_penalty: 0.7
- presence_penalty: 0.4
- message_history_window_size: 12 — context window size.
- prompts: Includes a user prompt template:
```
The user query is {sys.query}

The input files are {sys.files}
```
- sys_prompt: A detailed system instruction guiding the agent to:
  - Detect and extract all text from images regardless of style.
  - Identify source language.
  - Determine target language (user-specified or auto-detected).
  - Translate preserving meaning, tone, formatting.
  - Adapt translations contextually for signage, menus, labels.
  - Return well-formatted paragraphs; provide transliteration if requested.
  - Ask clarifying questions if image or language detection is unclear.
- visual_files_var: "sys.files" — variable referring to the input image files.
Downstream: Sends output content to the Message component.
Upstream: Receives input from the Begin component.

Return Values:

content (string): The translated text result.
structured_output (object): Placeholder for structured response (empty by default).

Usage Example:
User snaps a photo of a Japanese train schedule. The agent extracts and translates the text:

7:30 AM – 東京駅 (Tokyo Station)
8:15 AM – 新大阪 (Shin-Osaka)

c. Message:CurlyApplesRelate (Message Display)

Type: Message
Purpose: Displays the translated content produced by the Agent.
Parameters:
- content: Injects the output content from the Agent component ({Agent:CoolPandasCrash@content}).
Downstream: None (end of this chain).
Upstream: Receives translated content from the Agent.

2. Globals

Defines system-wide variables used during conversation and processing:

sys.conversation_turns: Starts at 0, tracks dialog turns.
sys.files: List of input image files.
sys.query: User query string.
sys.user_id: User identifier.

3. Graph Structure

Edges:
- begin → Agent:CoolPandasCrash
- Agent:CoolPandasCrash → Message:CurlyApplesRelate
Nodes: The components described above, positioned for visualization.

Important Implementation Details and Algorithms

The core translation logic is delegated to a large multimodal language model (qwen-vl-plus@Tongyi-Qianwen) configured with custom system prompts that specify a 7-step procedure for image text extraction and translation.
The system prompt instructs the agent to handle varied fonts, orientations, and styles, and to produce contextually appropriate translations, preserving formatting and tone.
The agent is designed to interactively ask clarifying questions if input images are unclear or if the target language is ambiguous.
Frequency and presence penalties are used to control repetitiveness in generated text.
The message history window size controls conversational context length, ensuring relevant prior messages inform the current translation.

Interaction with Other System Components

User Interface (UI): The Begin component corresponds to the app’s entry point UI, welcoming users.
Image Input: Users provide or snap images which are stored in sys.files.
AI Agent: The agent component processes the images and user queries, producing translations.
Message Display: Translated text is passed to the message component and shown to the user.
This file acts as the configuration for the AI workflow pipeline inside the app, defining the flow and parameters but not implementing the actual OCR or translation algorithms directly.

Visual Diagram

The following Mermaid class diagram represents the structure of the main components and their interactions in the ImageLingo DSL:

classDiagram
    class Begin {
        +enablePrologue: bool
        +prologue: string
        +mode: string
    }
    class Agent {
        +llm_id: string
        +llm_filter: string
        +max_tokens: int
        +max_retries: int
        +max_rounds: int
        +temperature: float
        +frequency_penalty: float
        +presence_penalty: float
        +message_history_window_size: int
        +prompts: list
        +sys_prompt: string
        +visual_files_var: string
        +processImageAndTranslate(sys.files, sys.query)
    }
    class Message {
        +content: string
        +displayTranslatedText()
    }

    Begin --> Agent : passes user query & files
    Agent --> Message : sends translated content

Summary

image_lingo.json is a declarative configuration file specifying the ImageLingo app’s AI-driven image translation pipeline.
It defines a small graph of components:
- Begin: welcomes user and starts the task.
- Agent: AI component performing image text extraction and translation.
- Message: outputs translated text to the user.
Configurations include large language model settings, prompt engineering, and global variables.
The file enables seamless integration of image recognition and multilingual translation for consumer use.
The architecture supports interactive clarification and context-aware translation, with a focus on preserving formatting and tone.

This documentation should help developers, analysts, and technical writers understand how the ImageLingo app is structured, how the AI agent operates, and how data flows through the system.