{"slug": "building-knowledge-graphs-with-gemini", "title": "Building Knowledge Graphs with Gemini", "summary": "A developer has demonstrated how to build structured knowledge graphs from unstructured documents using Google's Gemini AI model. The approach involves prototyping with Gemini to extract relationships and entities from raw text, then optimizing prompts and scaling up to process entire books or legal contracts. The method can visualize extracted narratives and contractual network graphs from documents.", "body_md": "In this exploration, we'll see how to turn raw, unstructured documents into structured knowledge graphs using Gemini. We'll start by prototyping to develop our intuition. Then, we'll optimize our prompts and outputs, and finally scale up to process entire books or dense legal contracts. By the end, we'll even visualize extracted book narratives and contractual network graphs!\n\n*A few notes before we start:*\n\nDocuments are everywhere. We use them for business, daily operations, legal matters, technical docs, education, and even just for fun. However, documents are not databases. They're generally unstructured, and fully understanding them requires multiple reading passes.\n\nSo, can we extract structured knowledge from documents using only the following?\n\nLet's try with Gemini…\n\nWe'll use the following packages:\n\n`google-genai`\n\nfor calling Gemini with the `networkx`\n\nfor graph managementWe'll also need:\n\n`tenacity`\n\nfor request management (a dependency of `google-genai`\n\n)`matplotlib`\n\nand `pillow`\n\nfor data visualization (dependencies of `networkx`\n\n)\n\n```\n%pip install --quiet \"google-genai>=2.6.0\" \"networkx[default]\"\n```\n\nTo use the Gemini API, we have two main options:\n\n**🛠️ Option 1 - Gemini API via Agent Platform**\n\nRequirements:\n\nGen AI SDK environment variables:\n\n`GOOGLE_GENAI_USE_ENTERPRISE=\"True\"`\n\n`GOOGLE_CLOUD_PROJECT=\"<PROJECT_ID>\"`\n\n`GOOGLE_CLOUD_LOCATION=\"<LOCATION>\"`\n\n💡 For preview models, the location must be set to\n\n`global`\n\n. For generally available models, we can choose the closest location among the[Google model endpoint locations].ℹ️ Learn more about\n\n[setting up a project and a development environment].\n\n**🛠️ Option 2 - Gemini API via Google AI Studio**\n\nRequirement:\n\nGen AI SDK environment variables:\n\n`GOOGLE_GENAI_USE_ENTERPRISE=\"False\"`\n\n`GOOGLE_API_KEY=\"<API_KEY>\"`\n\nℹ️ Learn more about\n\n[getting a Gemini API key from Google AI Studio].\n\n💡 You can store your environment configuration outside of the source code:\n\n| Environment | Method |\n|---|---|\n| IDE |\n`.env` file (or equivalent) |\n| Colab | Colab Secrets (🗝️ icon in left panel, see code below) |\n| Colab Enterprise | Google Cloud project and location are automatically defined |\n| Workbench | Google Cloud project and location are automatically defined |\n\n``` python\nimport os\nimport sys\nfrom collections.abc import Callable\n\nfrom google import genai\n\n# Manual setup (leave unchanged if setup is environment-defined)\n\n# @markdown **Which API: Agent Platform (formerly Vertex AI) or Google AI Studio?**\nGOOGLE_GENAI_USE_ENTERPRISE = True  # @param {type: \"boolean\"}\n\n# @markdown **Option A - Google Cloud project [+location]**\nGOOGLE_CLOUD_PROJECT = \"\"  # @param {type: \"string\"}\nGOOGLE_CLOUD_LOCATION = \"global\"  # @param {type: \"string\"}\n\n# @markdown **Option B - Google AI Studio API key**\nGOOGLE_API_KEY = \"\"  # @param {type: \"string\"}\n\ndef check_environment() -> bool:\n    check_colab_user_authentication()\n    return check_manual_setup() or check_enterprise() or check_colab() or check_local()\n\ndef check_manual_setup() -> bool:\n    return check_define_env_vars(\n        GOOGLE_GENAI_USE_ENTERPRISE,\n        GOOGLE_CLOUD_PROJECT.strip(),  # Might have been pasted with a newline\n        GOOGLE_CLOUD_LOCATION,\n        GOOGLE_API_KEY,\n    )\n\ndef check_enterprise() -> bool:\n    # Workbench and Colab Enterprise\n    match os.getenv(\"VERTEX_PRODUCT\", \"\"):\n        case \"WORKBENCH_INSTANCE\":\n            pass\n        case \"COLAB_ENTERPRISE\":\n            if not running_in_colab_env():\n                return False\n        case _:\n            return False\n\n    return check_define_env_vars(\n        True,\n        os.getenv(\"GOOGLE_CLOUD_PROJECT\", \"\"),\n        os.getenv(\"GOOGLE_CLOUD_REGION\", \"\"),\n        \"\",\n    )\n\ndef check_colab() -> bool:\n    if not running_in_colab_env():\n        return False\n\n    # Colab Enterprise was checked before, so this is Colab only\n    from google.colab import auth as colab_auth  # type: ignore\n\n    colab_auth.authenticate_user()\n\n    # Use Colab Secrets (🗝️ icon in left panel) to store the environment variables\n    # Secrets are private, visible only to you and the notebooks that you select\n    # - Agent Platform: Store your settings as secrets\n    # - Google AI: Directly import your Gemini API key from the UI\n    enterprise, project, location, api_key = get_vars(get_colab_secret)\n\n    return check_define_env_vars(enterprise, project, location, api_key)\n\ndef check_local() -> bool:\n    enterprise, project, location, api_key = get_vars(os.getenv)\n\n    return check_define_env_vars(enterprise, project, location, api_key)\n\ndef running_in_colab_env() -> bool:\n    # Colab or Colab Enterprise\n    return \"google.colab\" in sys.modules\n\ndef check_colab_user_authentication() -> None:\n    if running_in_colab_env():\n        from google.colab import auth as colab_auth  # type: ignore\n\n        colab_auth.authenticate_user()\n\ndef get_colab_secret(secret_name: str, default: str) -> str:\n    from google.colab import errors, userdata  # type: ignore\n\n    try:\n        return userdata.get(secret_name)\n    except errors.SecretNotFoundError:\n        return default\n\ndef disable_colab_cell_scrollbar() -> None:\n    if running_in_colab_env():\n        from google.colab import output  # type: ignore\n\n        output.no_vertical_scroll()\n\ndef get_vars(getenv: Callable[[str, str], str]) -> tuple[bool, str, str, str]:\n    # Limit getenv calls to the minimum (may trigger UI confirmation for secret access)\n    enterprise_str = getenv(\"GOOGLE_GENAI_USE_ENTERPRISE\", \"\")\n    if not enterprise_str:\n        enterprise_str = getenv(\"GOOGLE_GENAI_USE_VERTEXAI\", \"\")\n    if enterprise_str:\n        enterprise = enterprise_str.lower() in [\"true\", \"1\"]\n    else:\n        enterprise = bool(getenv(\"GOOGLE_CLOUD_PROJECT\", \"\"))\n\n    project = getenv(\"GOOGLE_CLOUD_PROJECT\", \"\") if enterprise else \"\"\n    location = getenv(\"GOOGLE_CLOUD_LOCATION\", \"\") if project else \"\"\n    api_key = getenv(\"GOOGLE_API_KEY\", \"\") if not project else \"\"\n\n    return enterprise, project, location, api_key\n\ndef check_define_env_vars(\n    enterprise: bool,\n    project: str,\n    location: str,\n    api_key: str,\n) -> bool:\n    match (enterprise, bool(project), bool(location), bool(api_key)):\n        case (True, True, _, _):\n            # Agent Platform - Google Cloud project [+location]\n            location = location or \"global\"\n            define_env_vars(enterprise, project, location, \"\")\n        case (True, False, _, True):\n            # Agent Platform - API key\n            define_env_vars(enterprise, \"\", \"\", api_key)\n        case (False, _, _, True):\n            # Google AI Studio - API key\n            define_env_vars(enterprise, \"\", \"\", api_key)\n        case _:\n            return False\n\n    return True\n\ndef define_env_vars(\n    enterprise: bool,\n    project: str,\n    location: str,\n    api_key: str,\n) -> None:\n    os.environ[\"GOOGLE_GENAI_USE_ENTERPRISE\"] = str(enterprise)\n    os.environ[\"GOOGLE_GENAI_USE_VERTEXAI\"] = str(enterprise)\n    os.environ[\"GOOGLE_CLOUD_PROJECT\"] = project\n    os.environ[\"GOOGLE_CLOUD_LOCATION\"] = location\n    os.environ[\"GOOGLE_API_KEY\"] = api_key\n\ndef check_configuration(client: genai.Client) -> None:\n    service = \"Agent Platform\" if client.vertexai else \"Google AI Studio\"\n    print(f\"✅ Using the {service} API\", end=\"\")\n\n    if client._api_client.project:\n        print(f' with project \"{client._api_client.project[:7]}…\"', end=\"\")\n        print(f' in location \"{client._api_client.location}\"')\n    elif client._api_client.api_key:\n        api_key = client._api_client.api_key\n        print(f' with API key \"{api_key[:5]}…{api_key[-5:]}\"', end=\"\")\n        print(f\" (in case of error, make sure it was created for {service})\")\n\nprint(\"✅ Environment functions defined\")\n✅ Environment functions defined\n```\n\nTo send Gemini requests, we'll use a `google.genai`\n\nclient:\n\n``` python\nfrom google import genai\n\ncheck_environment()\n\nclient = genai.Client()\n\ncheck_configuration(client)\n✅ Using the Agent Platform API with project \"lpdemo-…\" in location \"global\"\n```\n\nWe need a suite of test data to develop our solution.\n\n**Multimodality**\n\nWe'll test the following types:\n\n`text/plain`\n\n): Classic books are good text sources of varying lengths and languages.`application/pdf`\n\n): Legal agreements are also great examples of complex and dense documents.Gemini is natively multimodal, which means it can process different types of inputs. Once we've built knowledge graphs from text or PDF inputs, the solution will also naturally support the following formats:\n\n`image/*`\n\n)`audio/*`\n\n)`video/*`\n\n)**General knowledge**\n\n⚠️ LLMs are trained on general knowledge, which becomes part of their \"long-term memory\". To avoid generating memorized information, we'll explicitly instruct the model to use only the provided inputs.\n\n**Multilinguality**\n\nGemini is also natively [multilingual](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models#expandable-1), which lets us process inputs and generate outputs in 100+ languages.\n\nTo keep things general, we'll use English for prompts and knowledge graphs, but you can use any of the 100+ supported languages, as long as your prompts remain clear and explicit.\n\nLet's define a few data sources and helpers: 🔽\n\n``` python\nimport mimetypes\nfrom collections.abc import Iterator\nfrom enum import Enum\nfrom pathlib import Path\n\nfrom google.genai.types import Part\n\nGOOGLE_CLOUD_STORAGE_PREFIX = \"gs://\"\nHTTPS_PREFIX = \"https://\"\nFILE_PREFIX = \"file://\"\nLOCAL_FOLDER = \"./\"\n\nclass Source(Enum):\n    def yield_contents(self) -> Iterator[Part]:\n        file_uri = self.value\n        if not client.vertexai:\n            file_uri = convert_to_https_url_if_cloud_storage_uri(file_uri)\n        mime_type, _ = mimetypes.guess_type(file_uri)\n        assert mime_type is not None, f\"❌ Could not determine mime type: {file_uri=}\"\n\n        if file_uri.startswith((GOOGLE_CLOUD_STORAGE_PREFIX, HTTPS_PREFIX)):\n            yield Part.from_uri(file_uri=file_uri, mime_type=mime_type)\n            return\n\n        if file_uri.startswith(FILE_PREFIX):\n            file = Path(file_uri.removeprefix(FILE_PREFIX))\n            assert file.exists(), f\"❌ File does not exist: {file=}\"\n            if mime_type == \"text/plain\":\n                yield Part.from_text(text=file.read_text(encoding=\"utf-8\"))\n            else:\n                yield Part.from_bytes(data=file.read_bytes(), mime_type=mime_type)\n            return\n\n    def yield_source_names(self) -> Iterator[str]:\n        yield self.name\n\n    def yield_source_links(self) -> Iterator[str]:\n        file_uri = convert_to_https_url_if_cloud_storage_uri(self.value)\n        if file_uri.startswith(HTTPS_PREFIX):\n            yield file_uri\n            return\n        if file_uri.startswith(FILE_PREFIX):\n            yield file_uri.removeprefix(FILE_PREFIX)\n            return\n\ndef convert_to_https_url_if_cloud_storage_uri(uri: str) -> str:\n    return (\n        f\"{HTTPS_PREFIX}storage.googleapis.com/{uri.removeprefix(GOOGLE_CLOUD_STORAGE_PREFIX)}\"\n        if uri.startswith(GOOGLE_CLOUD_STORAGE_PREFIX)\n        else uri\n    )\n\ndef local_file(filename: str) -> str:\n    return f\"{FILE_PREFIX}{LOCAL_FOLDER}{filename}\"\n\n# You can find public domain books on Project Gutenberg: https://gutenberg.org/ebooks\ndef project_gutenberg_txt_url(id: int) -> str:\n    return f\"{HTTPS_PREFIX}gutenberg.org/cache/epub/{id}/pg{id}.txt\"\n\nclass Classic(Source):\n    en_hugo_les_misérables = project_gutenberg_txt_url(135)\n    en_dumas_count_of_monte_cristo = project_gutenberg_txt_url(1184)\n    fr_zola_thérèse_raquin = project_gutenberg_txt_url(7461)\n    fr_dumas_trois_mousquetaires = project_gutenberg_txt_url(13951)\n    fr_dumas_vingt_ans_après = project_gutenberg_txt_url(13952)\n    fr_dumas_comte_de_monte_cristo_1 = project_gutenberg_txt_url(17989)\n    fr_dumas_comte_de_monte_cristo_2 = project_gutenberg_txt_url(17990)\n    fr_dumas_comte_de_monte_cristo_3 = project_gutenberg_txt_url(17991)\n    fr_dumas_comte_de_monte_cristo_4 = project_gutenberg_txt_url(17992)\n\nclass Document(Source):\n    en_pharma_dev_agreement = \"gs://cloud-samples-data/documentai/ContractDocAI/CUAD_v1/Part_I/Development/PhasebioPharmaceuticalsInc_20200330_10-K_EX-10.21_12086810_EX-10.21_Development Agreement.pdf\"\n\nclass Collection(Source):\n    fr_dumas_comte_de_monte_cristo = [\n        Classic.fr_dumas_comte_de_monte_cristo_1,\n        Classic.fr_dumas_comte_de_monte_cristo_2,\n        Classic.fr_dumas_comte_de_monte_cristo_3,\n        Classic.fr_dumas_comte_de_monte_cristo_4,\n    ]\n    fr_dumas_trois_mousquetaires_vingt_ans_après = [\n        Classic.fr_dumas_trois_mousquetaires,\n        Classic.fr_dumas_vingt_ans_après,\n    ]\n\n    def yield_contents(self) -> Iterator[Part]:\n        for source in self.value:\n            yield from source.yield_contents()\n\n    def yield_source_names(self) -> Iterator[str]:\n        for source in self.value:\n            yield from source.yield_source_names()\n\n    def yield_source_links(self) -> Iterator[str]:\n        for source in self.value:\n            yield from source.yield_source_links()\n\ndef display_input_data_caption(source: Source) -> None:\n    names = list(source.yield_source_names())\n    links = list(source.yield_source_links())\n    links = \", \".join(\n        f\"[{name}](<{link}>)\" for name, link in zip(names, links, strict=True)\n    )\n    md = f\"**Input data** ({links})\"\n    display_markdown(md)\n\nprint(\"✅ Data helpers defined\")\n✅ Data helpers defined\n```\n\nGemini comes in different [versions](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models#gemini-models) and sizes (Flash-Lite, Flash, and Pro).\n\nLet's get started with Gemini 3.1 Flash-Lite, as it offers high performance, low latency, and very high output speed:\n\n`GEMINI_3_1_FLASH_LITE = \"gemini-3.1-flash-lite\"`\n\nGemini can be used in different ways, ranging from factual to creative modes. We're essentially dealing with a **data-extraction use case**. We want the results to be as factual and deterministic as possible. To achieve this, we can adjust the [content generation parameters](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/content-generation-parameters).\n\nWe'll set the `temperature`\n\n, `top_p`\n\n, and `seed`\n\nparameters to minimize randomness:\n\n`temperature=0.0`\n\n`top_p=0.0`\n\n`seed=42`\n\n(arbitrary fixed value)\n\n``` python\nfrom enum import StrEnum, auto\n\nimport IPython.display\nimport tenacity\nfrom google.genai.errors import ClientError\nfrom google.genai.types import (\n    FinishReason,\n    GenerateContentConfig,\n    GenerateContentResponse,\n    ThinkingConfig,\n    ThinkingLevel,\n)\n\nclass Model(Enum):\n    GEMINI_3_1_FLASH_LITE = \"gemini-3.1-flash-lite\"\n    GEMINI_3_5_FLASH = \"gemini-3.5-flash\"\n    GEMINI_2_5_FLASH = \"gemini-2.5-flash\"\n    GEMINI_2_5_PRO = \"gemini-2.5-pro\"\n    # Preview\n    GEMINI_3_1_PRO = \"gemini-3.1-pro-preview\"\n    # Default model\n    DEFAULT = GEMINI_3_1_FLASH_LITE\n\n# Default configuration for more deterministic outputs\nDEFAULT_CONFIG = GenerateContentConfig(\n    temperature=0.0,\n    top_p=0.0,\n    seed=42,  # Arbitrary fixed value\n)\n\nclass ShowAs(StrEnum):\n    DONT_SHOW = auto()\n    TEXT = auto()\n    MARKDOWN = auto()\n\ndef generate_content(\n    prompt: str,\n    source: Source | str | None = None,\n    *,\n    model: Model | None = None,\n    config: GenerateContentConfig | None = None,\n    system_instruction: str | None = None,\n    show_prompt: ShowAs = ShowAs.DONT_SHOW,\n    show_response: ShowAs = ShowAs.MARKDOWN,\n    only_show_prompt: bool = False,\n    return_response: bool = False,\n) -> GenerateContentResponse | None:\n    disable_colab_cell_scrollbar()\n\n    model = model or Model.DEFAULT\n    model_id = model.value\n    prompt_contents = get_prompt_contents(prompt, source, show_prompt, only_show_prompt)\n    if only_show_prompt:\n        return None\n    config = config or get_generate_content_config(model, system_instruction)\n    client = check_client_for_model(model)\n\n    response = None\n    display_request_header(model_id, source)\n    for attempt in get_retrier():\n        with attempt:\n            response = client.models.generate_content(\n                model=model_id,\n                contents=prompt_contents,  # type: ignore\n                config=config,\n            )\n            display_response_info(response)\n            display_response(response, show_response)\n\n    return response if return_response else None\n\ndef get_prompt_contents(\n    prompt: str,\n    source: Source | str | None,\n    show_prompt: ShowAs,\n    only_show_prompt: bool,\n) -> list[str | Part]:\n    def yield_prompt_contents() -> Iterator[str | Part]:\n        if not source:\n            yield prompt.strip()\n            return\n        yield \"==Start of input data==\\n\"\n        if isinstance(source, str):\n            yield f\"{source.strip()}\\n\"\n        else:\n            yield from source.yield_contents()\n        yield \"==End of input data==\\n\"\n        yield f\"==Start of user prompt==\\n{prompt.strip()}\\n==End of user prompt==\"\n\n    prompt_contents = list(yield_prompt_contents())\n    display_prompt(prompt_contents, show_prompt, only_show_prompt)\n\n    return prompt_contents\n\ndef get_generate_content_config(\n    model: Model,\n    system_instruction: str | None = None,\n) -> GenerateContentConfig:\n    thinking_config = get_thinking_config_for_model(model)\n\n    return GenerateContentConfig(\n        system_instruction=system_instruction,\n        temperature=DEFAULT_CONFIG.temperature,\n        top_p=DEFAULT_CONFIG.top_p,\n        seed=DEFAULT_CONFIG.seed,\n        thinking_config=thinking_config,\n    )\n\ndef get_thinking_config_for_model(model: Model) -> ThinkingConfig | None:\n    # Use minimal thinking configurations since our prompt will directly provide a chain of thought\n    match model:\n        case Model.GEMINI_2_5_FLASH:\n            return ThinkingConfig(thinking_budget=0)\n        case Model.GEMINI_2_5_PRO:\n            return ThinkingConfig(thinking_budget=128, include_thoughts=False)\n        case Model.GEMINI_3_1_FLASH_LITE | Model.GEMINI_3_5_FLASH:\n            return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)\n        case Model.GEMINI_3_1_PRO:\n            return ThinkingConfig(thinking_level=ThinkingLevel.LOW)\n        case _:\n            return None  # Default (dynamic thinking is generally enabled)\n\ndef check_client_for_model(model: Model) -> genai.Client:\n    if (\n        model.value.endswith(\"-preview\")\n        and client.vertexai\n        and client._api_client.location != \"global\"\n    ):\n        # Preview models are only available on the \"global\" location\n        return genai.Client(\n            enterprise=client.vertexai,\n            project=client._api_client.project,\n            location=\"global\",\n        )\n\n    return client\n\ndef get_retrier() -> tenacity.Retrying:\n    return tenacity.Retrying(\n        stop=tenacity.stop_after_attempt(7),\n        wait=tenacity.wait_incrementing(start=10, increment=1),\n        retry=tenacity.retry_if_exception(should_retry_request),\n        reraise=True,\n    )\n\ndef should_retry_request(err: BaseException) -> bool:\n    if not isinstance(err, ClientError):\n        return False\n    print(f\"❌ ClientError {err.code}: {err.message}\")\n\n    retry = False\n    match err.code:\n        case 400 if err.message is not None and \" try again \" in err.message:\n            # Workshop: project accessing Cloud Storage for the first time (service agent provisioning)\n            retry = True\n        case 429:\n            # Workshop: temporary project with 1 QPM quota\n            retry = True\n    print(f\"🔄 Retry: {retry}\")\n\n    return retry\n\nPRINT_COLUMNS = 80\nPRINT_SEPARATOR_CHAR = \"-\"\nPRINT_SEPARATOR = PRINT_COLUMNS * PRINT_SEPARATOR_CHAR\n\ndef print_caption(caption: str) -> None:\n    print(f\" {caption} \".center(PRINT_COLUMNS, PRINT_SEPARATOR_CHAR))\n\ndef print_separator() -> None:\n    print(PRINT_SEPARATOR)\n\ndef display_response_info(response: GenerateContentResponse) -> None:\n    if usage_metadata := response.usage_metadata:\n        if usage_metadata.prompt_token_count:\n            print(f\"Input tokens    : {usage_metadata.prompt_token_count:9,d}\")\n        if usage_metadata.cached_content_token_count:\n            print(f\"Cached tokens   : {usage_metadata.cached_content_token_count:9,d}\")\n        if usage_metadata.candidates_token_count:\n            print(f\"Output tokens   : {usage_metadata.candidates_token_count:9,d}\")\n        if usage_metadata.thoughts_token_count:\n            print(f\"Thoughts tokens : {usage_metadata.thoughts_token_count:9,d}\")\n    if not response.candidates:\n        print(\"❌ No `response.candidates`\")\n        return\n    if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:\n        print(f\"❌ {finish_reason = }\")\n    if not response.text:\n        print(\"❌ No `response.text`\")\n        return\n\ndef display_prompt(\n    contents: list[str | Part],\n    show_as: ShowAs,\n    only_show_prompt: bool,\n) -> None:\n    def yield_prompt_strings() -> Iterator[str]:\n        for content in contents:\n            if isinstance(content, Part):\n                yield f\"{content!r}\\n\"\n                continue\n            yield content\n\n    if only_show_prompt and show_as == ShowAs.DONT_SHOW:\n        show_as = ShowAs.TEXT\n    if show_as == ShowAs.DONT_SHOW:\n        return\n\n    separator = \"\\n\" if show_as == ShowAs.MARKDOWN else \"\"\n    prompt = separator.join(yield_prompt_strings())\n    print_caption(\"Prompt\")\n    match show_as:\n        case ShowAs.TEXT:\n            print(prompt)\n        case ShowAs.MARKDOWN:\n            display_markdown(prompt)\n    if only_show_prompt:\n        print_separator()\n\ndef display_request_header(model_id: str, source: Source | str | None = None) -> None:\n    print_caption(f\"Request / {model_id}\")\n\ndef display_response(response: GenerateContentResponse, show_as: ShowAs) -> None:\n    if show_as == ShowAs.DONT_SHOW or not (response_text := response.text):\n        return\n    print_caption(\"Start of Response\")\n    response_text = response_text.strip()\n    match show_as:\n        case ShowAs.TEXT:\n            print(response_text)\n        case ShowAs.MARKDOWN:\n            display_markdown(response_text)\n    print_caption(\"End of Response\")\n\ndef display_markdown(markdown: str) -> None:\n    IPython.display.display(IPython.display.Markdown(markdown))\n\nprint(\"✅ Helpers defined\")\n✅ Helpers defined\n```\n\nBefore diving into a solution, it helps to start by prototyping to build some intuition about the natural behavior of the model.\n\nLet's define a short text of a few sentences:\n\n```\ntext = \"\"\"\n- Henry Jones is a famous archaeologist. He is actually a \"Junior\" because he is named after his father.\n- Sophie is Henry's daughter, shares his last name, and works as a software engineer.\n- William Smith is an aerospace engineer and Sophie's lifelong friend. Everybody calls him Bill and Beau is his dog.\n- Short Round met Henry as a child. They first became close friends, and Henry officially adopted him a few years later.\n- Sophie and Bill both work at Acme Aerospace.\n\"\"\"\n```\n\n🧪 Let's see if Gemini can spot our characters…\n\n```\nprompt = \"\"\"\nUsing only the input data, list all people and animals mentioned.\n\"\"\"\ngenerate_content(prompt, text)\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :       148\nOutput tokens   :        67\n------------------------------ Start of Response -------------------------------\nBased on the input data provided, here are the people and animals mentioned:\n\n**People:**\n*   Henry Jones (also known as Henry Jones Junior)\n*   Sophie Jones\n*   William Smith (also known as Bill)\n*   Short Round\n\n**Animals:**\n*   Beau (Bill's dog)\n------------------------------- End of Response --------------------------------\n```\n\n💡 All people and animals are detected as expected.\n\n🧪 Now, let's see if it can connect the dots and figure out who's who…\n\n```\nprompt = \"\"\"\nUsing only the input data, list all people and animals mentioned, and how they relate to each other.\n\"\"\"\ngenerate_content(prompt, text)\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :       156\nOutput tokens   :       168\n------------------------------ Start of Response -------------------------------\nBased on the input data provided, here are the people and animals mentioned and their relationships:\n\n**People:**\n*   **Henry Jones (Junior):** A famous archaeologist. He is the father of Sophie, the adoptive father of Short Round, and is named after his own father.\n*   **Sophie Jones:** A software engineer at Acme Aerospace. She is the daughter of Henry Jones and a lifelong friend of Bill (William Smith).\n*   **William (Bill) Smith:** An aerospace engineer at Acme Aerospace. He is a lifelong friend of Sophie and the owner of Beau.\n*   **Short Round:** The adopted son of Henry Jones. He met Henry as a child and they became close friends before the adoption.\n\n**Animals:**\n*   **Beau:** A dog owned by William (Bill) Smith.\n------------------------------- End of Response --------------------------------\n```\n\n💡 Notes\n\nWe're not domain experts in the field we're exploring (yet!).\n\nAn LLM processes instructions based on the given prompt and its training knowledge. This knowledge is part of its long-term memory, and we can learn a lot directly from the model itself.\n\n🧪 Let's ask Gemini:\n\n```\nprompt = \"\"\"\nWhat is the terminology used when building a knowledge graph?\nPlease provide a simple data example in JSON.\n\"\"\"\ngenerate_content(prompt)\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :        21\nOutput tokens   :       581\n------------------------------ Start of Response -------------------------------\nBuilding a knowledge graph involves representing information as a network of interconnected entities. Here is the core terminology and a simple data example.\n\n### Core Terminology\n\n1.  **Entity (Node):** The \"things\" in your graph (e.g., a person, a place, a product).\n2.  **Relationship (Edge/Link):** The connection between two entities. It describes how they interact (e.g., \"works at,\" \"lives in,\" \"is a friend of\").\n3.  **Property (Attribute):** Key-value pairs that provide more detail about an entity or a relationship (e.g., a person's \"age\" or a relationship's \"start_date\").\n4.  **Label:** A category assigned to a node or edge to define its type (e.g., a node might have the label \"Person\").\n5.  **Schema (Ontology):** The formal structure or \"blueprint\" that defines the types of entities allowed and the rules for how they can be connected.\n6.  **Triple:** The fundamental unit of a knowledge graph, consisting of a **Subject → Predicate → Object** (e.g., *Alice* → *works at* → *Google*).\n\n---\n\n### Simple Data Example (JSON)\n\nIn a knowledge graph, data is often represented as a collection of **Nodes** and **Edges**.\n\n``` json\n{\n\"nodes\": [\n    {\n    \"id\": \"1\",\n    \"label\": \"Person\",\n    \"properties\": {\n        \"name\": \"Alice\",\n        \"age\": 30\n    }\n    },\n    {\n    \"id\": \"2\",\n    \"label\": \"Company\",\n    \"properties\": {\n        \"name\": \"Google\",\n        \"industry\": \"Technology\"\n    }\n    }\n],\n\"edges\": [\n    {\n    \"id\": \"e1\",\n    \"source\": \"1\",\n    \"target\": \"2\",\n    \"label\": \"WORKS_AT\",\n    \"properties\": {\n        \"since\": 2020\n    }\n    }\n]\n}\n```\n\n### Breakdown of the Example:\n\n*   **Nodes:** We have two entities: \"Alice\" (a Person) and \"Google\" (a Company).\n*   **Edge:** We have one relationship: \"WORKS_AT\" connecting Alice to Google.\n*   **Properties:** We stored specific details like Alice's age and the year she started working at Google.\n*   **Triple representation:** This JSON effectively encodes the triple: **(Alice) —[WORKS_AT]—> (Google)**.\n------------------------------- End of Response --------------------------------\n```\n\n💡 We learn that knowledge graphs are made of **entities** and **relationships**, also called **nodes** and **edges**, and we get a nice introduction to the field. Using domain terminology will help make our prompts explicit and precise.\n\nTo extract knowledge graphs, we'll reason in terms of entities and relationships, adopting domain terminology.\n\nIf we think of the final result as a database, our goal is to generate two linked tables, allowing us to reason in terms of data and fields.\n\nHere is a conceptual view of what we want to achieve:\n\n**Entities**\n\n`id` |\n`name` |\n`label` |\n|---|---|---|\n| 0 | Henry Jones Jr. | person |\n| 1 | Henry Jones Sr. | person |\n\n**Relationships**\n\n`source_id` |\n`link` |\n`target_id` |\n|---|---|---|\n| 0 | child_of | 1 |\n\nLet's call this approach \"tabular extraction\" and split our instructions to output two successive tables, while still using a single request…\n\nIn our prototype text, the entities we want to extract are characters (like people or animals). We can define an entity data schema with the fields `id`\n\n(0, 1, 2…), `name`\n\n(full name of the entity), and `label`\n\n(`person`\n\n|`animal`\n\n).\n\n🧪 Let's extract the entities:\n\n```\nprompt = \"\"\"\n**Data Schema**\n\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Full name of the entity.\n- `label`: `person`|` animal`.\n\n**Instructions**\n\n1. Entity Extraction:\n   - Extract every distinct entity from the input data that matches an allowed `label`.\n   - Include entities that are explicitly named as well as implied entities whose names can be determined from the context.\n2. Output the results as a JSON array inside a fenced code block.\n\"\"\"\n\ngenerate_content(prompt, text)\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :       249\nOutput tokens   :       195\n------------------------------ Start of Response -------------------------------\n[\n  {\n    \"id\": 0,\n    \"name\": \"Henry Jones Jr.\",\n    \"label\": \"person\"\n  },\n  {\n    \"id\": 1,\n    \"name\": \"Henry Jones Sr.\",\n    \"label\": \"person\"\n  },\n  {\n    \"id\": 2,\n    \"name\": \"Sophie Jones\",\n    \"label\": \"person\"\n  },\n  {\n    \"id\": 3,\n    \"name\": \"William Smith\",\n    \"label\": \"person\"\n  },\n  {\n    \"id\": 4,\n    \"name\": \"Beau\",\n    \"label\": \"animal\"\n  },\n  {\n    \"id\": 5,\n    \"name\": \"Short Round\",\n    \"label\": \"person\"\n  }\n]\n------------------------------- End of Response --------------------------------\n```\n\n💡 Remarks\n\n🧪 Now, let's extract both the entities and their relationships:\n\n```\nprompt = \"\"\"\n**Data Schema**\n\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Full name of the entity.\n- `label`: `person`|` animal`.\n\nRelationship:\n- `source_id`: `id` of the subject entity.\n- `link`: `snake_case` predicate describing the relationship.\n- `target_id`: `id` of the object entity.\n\n**Instructions**\n\n1. Entity Extraction:\n   - Extract every distinct entity from the input data that matches an allowed `label`.\n   - Include entities that are explicitly named as well as implied entities whose names can be determined from the context.\n2. Relationship Extraction:\n   - Extract every distinct relationship between them.\n   - If a relationship changes over time, make sure to include every distinct stage of the relationship.\n3. Output a JSON object with keys `entities` and `relationships` inside a fenced code block.\n\"\"\"\n\nresponse = generate_content(prompt, text, return_response=True)\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :       340\nOutput tokens   :       456\n------------------------------ Start of Response -------------------------------\n{\n  \"entities\": [\n    {\n      \"id\": 0,\n      \"name\": \"Henry Jones Jr.\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 1,\n      \"name\": \"Henry Jones Sr.\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 2,\n      \"name\": \"Sophie Jones\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 3,\n      \"name\": \"William Smith\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 4,\n      \"name\": \"Beau\",\n      \"label\": \"animal\"\n    },\n    {\n      \"id\": 5,\n      \"name\": \"Short Round\",\n      \"label\": \"person\"\n    }\n  ],\n  \"relationships\": [\n    {\n      \"source_id\": 0,\n      \"link\": \"child_of\",\n      \"target_id\": 1\n    },\n    {\n      \"source_id\": 2,\n      \"link\": \"child_of\",\n      \"target_id\": 0\n    },\n    {\n      \"source_id\": 3,\n      \"link\": \"friend_of\",\n      \"target_id\": 2\n    },\n    {\n      \"source_id\": 4,\n      \"link\": \"pet_of\",\n      \"target_id\": 3\n    },\n    {\n      \"source_id\": 5,\n      \"link\": \"friend_of\",\n      \"target_id\": 0\n    },\n    {\n      \"source_id\": 0,\n      \"link\": \"adopted\",\n      \"target_id\": 5\n    },\n    {\n      \"source_id\": 5,\n      \"link\": \"child_of\",\n      \"target_id\": 0\n    }\n  ]\n}\n------------------------------- End of Response --------------------------------\n```\n\n💡 Remarks\n\n`relationships`\n\narray.`link`\n\npredicates are completely dynamic (a level of flexibility we left in the prompt). While it's interesting to see this natural behavior, we'll want to make it more deterministic for production since our prompt has too many degrees of freedom.`pet_of`\n\n[Person]\" is an asymmetric relationship that could also be extracted as \"[Person] `owner_of`\n\n[Animal]\". This is another area where the prompt is too open-ended. In the finalization section, we'll see an example that asks the model to extract symmetric and asymmetric relationships in both directions.We've concluded our prototyping stage with promising results using a data schema.\n\nTo move to production, the next step is to control the generation with a specific structured output.\n\nThe JSON format has industry-wide support and serves as a core or intermediate format in many use cases.\n\nFor the next step, we would typically define classes using the Pydantic library and request a pure JSON output with a response schema in the config parameters:\n\n`response_mime_type=\"application/json\"`\n\n`response_schema=\"CLASS_DERIVED_FROM_PYDANTIC_BASE_MODEL\"`\n\n(⚠️ However, JSON is a pretty verbose format, designed for interoperability but not optimized for size. Even if we generate compact JSON (also called minified JSON), it still has inherent verbosity due to:\n\nℹ️ When using LLMs, once the first token is generated, the remaining generation time is roughly proportional to the number of output tokens. Similarly, the cost of a request is based on token usage (input + output), with output tokens being significantly more expensive than input tokens.\n\n💡 A better output structure will positively impact both generation speed and request cost.\n\nLet's explore an alternative…\n\nOur tabular-extraction problem clearly calls for table outputs. An interesting possibility is to ask for Tab-Separated Values (TSV) outputs. For example, we can define our output to be formatted as two consecutive TSV tables.\n\n**Example output format**\n\ntsv filename=\"entities.tsv\"\nid{TAB}name{TAB}label\n[rows]\ntsv filename=\"relationships.tsv\"\nsource_id{TAB}link{TAB}target_id\n[rows]\n**Will this work?**\n\nGenerating structured outputs like TSV will work seamlessly, as Gemini excels at patterns. We just need to be explicit about what's expected.\n\n**Will this be efficient?**\n\n💡 For our use case, this structure looks optimal:\n\nℹ️ CSV could be another alternative, but common separators like commas are everywhere in natural language and frequently appear in names and descriptions (e.g., if we decide to extend entity fields). If you're interested in this topic, check out the [TOON format](https://github.com/toon-format/toon), which proposes a JSON alternative using a YAML+CSV mix.\n\nTo make an informed decision, we should actually compare the number of tokens needed to represent the same data…\n\n``` python\nimport csv\nimport io\nimport json\nimport re\nfrom typing import Literal\n\ndef get_data_from_response(response: GenerateContentResponse) -> dict:\n    response_text = response.text or \"\"\n    pattern = r\"```\n\njson\\s*(.*?)\\s*\n\n```\"\n    match = re.search(pattern, response_text, re.DOTALL)\n    json_str = match.group(1) if match else response_text\n    try:\n        data = json.loads(json_str)\n        if not isinstance(data, dict):\n            print(\"❌ Returning empty dict (could not parse response as dict)\")\n            data = {}\n    except (json.JSONDecodeError, TypeError):\n        print(\"❌ Returning empty dict (failed parsing the JSON string)\")\n        data = {}\n    return data\n\ndef get_tsv_string_from_data(data: dict) -> str:\n    output = \"\"\n    for key, items in data.items():\n        rows = \"\"\n        if items:\n            with io.StringIO() as out:\n                headers = list(items[0].keys())\n                writer = csv.DictWriter(\n                    out,\n                    fieldnames=headers,\n                    delimiter=\"\\t\",\n                    lineterminator=\"\\n\",\n                )\n                writer.writeheader()\n                writer.writerows(items)\n                rows = out.getvalue()\n        if output:\n            output += \"\\n\"\n        output += f'```\n{% endraw %}\ntsv filename=\"{key}.tsv\"\\n{rows}\n{% raw %}\n```\\n'\n    return output\n\ndef print_text_excerpt(title: str, text: str, max_chars: int = 400) -> None:\n    assert max_chars > 0\n    chars = len(text)\n    if chars <= 0:\n        print_caption(\"❌ Empty text\")\n        return\n    if chars <= max_chars:\n        print_caption(f\"{title} ({chars} chars)\")\n        print(text)\n        return\n    print_caption(\n        f\"{title} - First {max_chars}/{chars} chars ({max_chars / chars:.0%})\"\n    )\n    print(f\"{text[:max_chars]}…\")\n\ndef compare_json_vs_tsv(\n    response: GenerateContentResponse | None,\n    only_show_excerpts: bool = False,\n) -> None:\n    def get_gain(rows: list[tuple[int, int]], col: Literal[0, 1]) -> str:\n        val_0, val_1 = rows[0][col], rows[1][col]\n        return f\"**{1 - val_1 / val_0:.1%}**\" if val_0 > 0 else \"?\"\n\n    def yield_table_string_rows(\n        source_title: str,\n        target_title: str,\n        rows: list[tuple[int, int]],\n    ) -> Iterator[list[str]]:\n        yield [\"\", \"Chars\", \"Tokens\"]\n        yield [\"-\", \"-:\", \"-:\"]\n        for caption, values in zip([source_title, target_title], rows):\n            yield [caption, *[str(value) for value in values]]\n        yield [\"**Gain**\", get_gain(rows, 0), get_gain(rows, 1)]\n\n    def display_gain_table(\n        source_title: str,\n        target_title: str,\n        source_text: str,\n        target_text: str,\n    ) -> None:\n        print_caption(f\"{source_title} → {target_title}\")\n        model = Model.DEFAULT\n        model_id = model.value\n        client = check_client_for_model(model)\n\n        rows: list[tuple[int, int]] = []\n        for s in [source_text, target_text]:\n            chars = len(s)\n            tokens = client.models.count_tokens(model=model_id, contents=s).total_tokens\n            rows.append((chars, tokens or 0))\n        markdown = \"\\n\".join(\n            \"| \" + \" | \".join(row) + \" |\"\n            for row in yield_table_string_rows(source_title, target_title, rows)\n        )\n        display_markdown(markdown)\n\n    if response is None:\n        print(\"❌ response is None\")\n        return\n    data = get_data_from_response(response)\n    formatted_json = f\"```\n{% endraw %}\njson\\n{json.dumps(data, indent=2)}\\n\n{% raw %}\n```\"\n    compact_json = f\"```\n{% endraw %}\njson\\n{json.dumps(data, separators=(',', ':'))}\\n\n{% raw %}\n```\"\n    tsv = get_tsv_string_from_data(data)\n\n    if only_show_excerpts:\n        max_chars = len(tsv)\n        print_text_excerpt(\"Formatted JSON\", formatted_json, max_chars)\n        print_text_excerpt(\"Compact JSON\", compact_json, max_chars)\n        print_text_excerpt(\"TSV\", tsv, max_chars)\n        return\n\n    display_gain_table(\"Formatted JSON\", \"Compact JSON\", formatted_json, compact_json)\n    display_gain_table(\"Compact JSON\", \"TSV\", compact_json, tsv)\n    display_gain_table(\"Formatted JSON\", \"TSV\", formatted_json, tsv)\n    print_separator()\n\nprint(\"✅ JSON vs TSV helpers defined\")\n✅ JSON vs TSV helpers defined\n```\n\n🧪 First, let's compare how much data we can represent for the same number of characters based on our latest API response:\n\n```\ncompare_json_vs_tsv(response, only_show_excerpts=True)\n----------------- Formatted JSON - First 335/1122 chars (30%) ------------------\n``` json\n{\n  \"entities\": [\n    {\n      \"id\": 0,\n      \"name\": \"Henry Jones Jr.\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 1,\n      \"name\": \"Henry Jones Sr.\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 2,\n      \"name\": \"Sophie Jones\",\n      \"label\": \"person\"\n    },\n    {\n      \"id\": 3,\n      \"name\": \"William Smith\",\n     …\n------------------- Compact JSON - First 335/665 chars (50%) -------------------\n``` json\n{\"entities\":[{\"id\":0,\"name\":\"Henry Jones Jr.\",\"label\":\"person\"},{\"id\":1,\"name\":\"Henry Jones Sr.\",\"label\":\"person\"},{\"id\":2,\"name\":\"Sophie Jones\",\"label\":\"person\"},{\"id\":3,\"name\":\"William Smith\",\"label\":\"person\"},{\"id\":4,\"name\":\"Beau\",\"label\":\"animal\"},{\"id\":5,\"name\":\"Short Round\",\"label\":\"person\"}],\"relationships\":[{\"source_i…\n------------------------------- TSV (335 chars) --------------------------------\n``` tsv filename=\"entities.tsv\"\nid  name    label\n0   Henry Jones Jr. person\n1   Henry Jones Sr. person\n2   Sophie Jones    person\n3   William Smith   person\n4   Beau    animal\n5   Short Round person\ntsv filename=\"relationships.tsv\"\nsource_id   link    target_id\n0   child_of    1\n2   child_of    0\n3   friend_of   2\n4   pet_of  3\n5   friend_of   0\n0   adopted 5\n5   child_of    0\n💡 Notice how much more information can be represented in the same number of text characters. This will apply similarly to token counts.\n\n🧪 And now, let's compare the gains, especially for token counts:\n\n```\ncompare_json_vs_tsv(response)\n------------------------ Formatted JSON → Compact JSON -------------------------\n```\n\n| Chars | Tokens | |\n|---|---|---|\n| Formatted JSON | 1122 | 456 |\n| Compact JSON | 665 | 216 |\nGain |\n40.7% |\n52.6% |\n\n```\n------------------------------ Compact JSON → TSV ------------------------------\n```\n\n| Chars | Tokens | |\n|---|---|---|\n| Compact JSON | 665 | 216 |\n| TSV | 335 | 137 |\nGain |\n49.6% |\n36.6% |\n\n```\n----------------------------- Formatted JSON → TSV -----------------------------\n```\n\n| Chars | Tokens | |\n|---|---|---|\n| Formatted JSON | 1122 | 456 |\n| TSV | 335 | 137 |\nGain |\n70.1% |\n70.0% |\n\n```\n--------------------------------------------------------------------------------\n```\n\n💡 **Savings in output tokens:**\n\nWith a double-digit percentage reduction in output tokens, building knowledge graphs with TSV outputs is significantly faster (and cheaper)!\n\nNow, let's finalize our code with optimized structures…\n\nFirst, it helps to define a structured prompt template, so we can focus on specific parts of our solution using a divide-and-conquer approach.\n\nHere's a possible prompt template:\n\n```\nKNOWLEDGE_GRAPH_PROMPT_TEMPLATE = \"\"\"\n**Data Schema**\n\n{data_schema}\n\n**Instructions**\n\n{instructions}\n\n**Output Format**\n\n{output_format}\n\"\"\"\n```\n\nThen, here are some possible `Entity`\n\n, `Relationship`\n\n, and `KnowledgeGraph`\n\ndata classes with the matching output format:\n\n``` python\nfrom dataclasses import dataclass, field\n\n@dataclass\nclass Entity:\n    id: int\n    name: str\n    label: str\n\n@dataclass\nclass Relationship:\n    source_id: int\n    link: str\n    target_id: int\n\n@dataclass\nclass KnowledgeGraph:\n    entities: list[Entity] = field(default_factory=list)\n    relationships: list[Relationship] = field(default_factory=list)\n\nTAB = \"\\t\"\nKNOWLEDGE_GRAPH_OUTPUT_FORMAT = f\"\"\"\nFormat the output strictly as two TSV code blocks (including the header row):\n\n``` tsv filename=\"entities.tsv\"\nid{TAB}name{TAB}label\n[data_rows]\ntsv filename=\"relationships.tsv\"\nsource_id{TAB}link{TAB}target_id\n[data_rows]\n```\n\"\"\"\n```\n\n💡 While the Gen AI SDK natively supports Pydantic models for JSON structured outputs, we're using standard Python data classes here and TSV outputs to maximize our token efficiency.\n\nℹ️ If you use multiple entity or relationship data classes in your solution, you can dynamically generate the output format specification using features of the `dataclasses`\n\npackage (like class docstrings and field descriptions).\n\n``` python\nfrom dataclasses import fields, is_dataclass\nfrom typing import get_args, get_origin, get_type_hints\n\ndef generate_knowledge_graph(\n    data_schema: str,\n    instructions: str,\n    source: Source | str,\n    *,\n    model: Model | None = None,\n    config: GenerateContentConfig | None = None,\n    system_instruction: str | None = None,\n    show_prompt: ShowAs = ShowAs.DONT_SHOW,\n    show_response: ShowAs = ShowAs.DONT_SHOW,\n) -> KnowledgeGraph:\n    prompt = get_prompt_for_data_schema_and_instructions(data_schema, instructions)\n    response = generate_content(\n        prompt,\n        source,\n        model=model,\n        config=config,\n        system_instruction=system_instruction,\n        show_prompt=show_prompt,\n        show_response=show_response,\n        return_response=True,\n    )\n\n    knowledge_graph = (\n        parse_list_dataclass(KnowledgeGraph, response)\n        if isinstance(response, GenerateContentResponse)\n        else KnowledgeGraph()\n    )\n    display_knowledge_graph_info(knowledge_graph)\n\n    return knowledge_graph\n\ndef show_knowledge_graph_prompt(\n    data_schema: str,\n    instructions: str,\n    source: Source | str,\n    *,\n    model: Model | None = None,\n    config: GenerateContentConfig | None = None,\n    system_instruction: str | None = None,\n    show_as: ShowAs = ShowAs.TEXT,\n) -> None:\n    prompt = get_prompt_for_data_schema_and_instructions(data_schema, instructions)\n    generate_content(\n        prompt,\n        source,\n        model=model,\n        config=config,\n        system_instruction=system_instruction,\n        show_prompt=show_as,\n        only_show_prompt=True,\n    )\n\ndef get_prompt_for_data_schema_and_instructions(\n    data_schema: str,\n    instructions: str,\n) -> str:\n    return KNOWLEDGE_GRAPH_PROMPT_TEMPLATE.format(\n        data_schema=data_schema.strip(),\n        instructions=instructions.strip(),\n        output_format=KNOWLEDGE_GRAPH_OUTPUT_FORMAT.strip(),\n    ).strip()\n\ndef parse_list_dataclass[T](cls: type[T], response: GenerateContentResponse) -> T:\n    assert is_dataclass(cls)\n    if not (response_text := response.text):\n        return cls()\n\n    data = {}\n    for f in fields(cls):\n        origin, list_types = get_origin(f.type), get_args(f.type)\n        assert (\n            origin is list\n        ), f\"❌ Field {f.name} must be a list[dataclass] parameterized list\"\n        assert len(list_types) == 1, f\"❌ Expected 1 single type: {len(list_types)=}\"\n        data[f.name] = parse_tsv_block(list_types[0], response_text, f.name)\n\n    return cls(**data)\n\ndef parse_tsv_block[T](cls: type[T], data: str, tsv_filestem: str) -> list[T]:\n    rows = []\n    valid_fields = get_type_hints(cls)\n    tsv_string = extract_tsv_block(data, tsv_filestem)\n    for row in csv.DictReader(io.StringIO(tsv_string), delimiter=\"\\t\"):\n        casted_data = {}\n        for key, value in row.items():\n            if key not in valid_fields or value is None:\n                continue\n            field_type = valid_fields[key]\n            try:  # Note: Works only for directly castable types such as int, float, str, enum (e.g., not bool)\n                casted_data[key] = field_type(value)\n            except (ValueError, TypeError):\n                print(f'❌ Could not cast \"{value}\" to {field_type} → Skipping {row=}')\n                break\n        else:\n            try:\n                rows.append(cls(**casted_data))\n            except TypeError as e:\n                print(f\"❌ Could not instantiate {cls.__name__}: {e} → Skipping {row=}\")\n\n    return rows\n\ndef extract_tsv_block(data: str, filestem: str) -> str:\n    pattern = rf'```\n{% endraw %}\ntsv filename=\"{re.escape(filestem)}.tsv\"\\s*\\n(.*?)\n{% raw %}\n```'\n    if not (match := re.search(pattern, data, flags=re.DOTALL)):\n        print(f'❌ Could not find a TSV block for \"{filestem=}\"')\n        return \"\"\n    return match.group(1)\n\ndef display_knowledge_graph_info(kg: KnowledgeGraph) -> None:\n    print_caption(\"Knowledge Graph Info\")\n    print(f\"Entities      : {len(kg.entities):3,d}\")\n    print(f\"Relationships : {len(kg.relationships):3,d}\")\n    print_separator()\n\nprint(\"✅ Knowledge graph generation helpers defined\")\n✅ Knowledge graph generation helpers defined\n```\n\nAnd here is a possible data schema with some instructions to generate a knowledge graph for our book analysis use case:\n\n``` python\nfrom enum import StrEnum, auto\n\nclass BookAnalysisEntityLabel(StrEnum):\n    PERSON = auto()\n    ANIMAL = auto()\n    ORGANIZATION = auto()\n\ndef pipe_delimited_union(enum: type[StrEnum]) -> str:\n    return \"|\".join(f\"`{e.value}`\" for e in enum)\n\nBOOK_ANALYSIS_DATA_SCHEMA = f\"\"\"\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Most complete name as exclusively determined from the input data.\n- `label`: {pipe_delimited_union(BookAnalysisEntityLabel)}.\n\nRelationship:\n- `source_id`: `id` of the subject entity.\n- `link`: `snake_case` predicate.\n- `target_id`: `id` of the object entity.\n\"\"\"\n\nBOOK_ANALYSIS_INSTRUCTIONS = \"\"\"\n- Extract every distinct entity:\n  - Treat distinct pseudonyms/identities as separate entities.\n  - Include implied entities whose names can be exclusively determined from the input data.\n- Extract every distinct relationship between them:\n  - Use specific `link` predicates in `snake_case` as needed (e.g., `alias_of`, `son_of`, `fiancée_of`, `friend_of`, `murderer_of`, `employer_of`, `in_love_with`, `rival_of`).\n  - If a relationship changes over time, make sure to include every distinct stage of the relationship.\n  - For every asymmetric relationship extracted, make sure to include the logical inverse relationship (e.g., `A husband_of B` AND `B wife_of A`, `A employer_of B` AND `B employee_of A`).\n  - For every symmetric relationship extracted, make sure to include both directions (e.g., `A friend_of B` AND `B friend_of A`).\n\"\"\"\n```\n\nVerify the structured prompt:\n\n```\nshow_knowledge_graph_prompt(\n    BOOK_ANALYSIS_DATA_SCHEMA,\n    BOOK_ANALYSIS_INSTRUCTIONS,\n    text,\n    show_as=ShowAs.TEXT,\n)\n------------------------------------ Prompt ------------------------------------\n==Start of input data==\n- Henry Jones is a famous archaeologist. He is actually a \"Junior\" because he is named after his father.\n- Sophie is Henry's daughter, shares his last name, and works as a software engineer.\n- William Smith is an aerospace engineer and Sophie's lifelong friend. Everybody calls him Bill and Beau is his dog.\n- Short Round met Henry as a child. They first became close friends, and Henry officially adopted him a few years later.\n- Sophie and Bill both work at Acme Aerospace.\n==End of input data==\n==Start of user prompt==\n**Data Schema**\n\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Most complete name as exclusively determined from the input data.\n- `label`: `person`|` animal`|` organization`.\n\nRelationship:\n- `source_id`: `id` of the subject entity.\n- `link`: `snake_case` predicate.\n- `target_id`: `id` of the object entity.\n\n**Instructions**\n\n- Extract every distinct entity:\n  - Treat distinct pseudonyms/identities as separate entities.\n  - Include implied entities whose names can be exclusively determined from the input data.\n- Extract every distinct relationship between them:\n  - Use specific `link` predicates in `snake_case` as needed (e.g., `alias_of`, `son_of`, `fiancée_of`, `friend_of`, `murderer_of`, `employer_of`, `in_love_with`, `rival_of`).\n  - If a relationship changes over time, make sure to include every distinct stage of the relationship.\n  - For every asymmetric relationship extracted, make sure to include the logical inverse relationship (e.g., `A husband_of B` AND `B wife_of A`, `A employer_of B` AND `B employee_of A`).\n  - For every symmetric relationship extracted, make sure to include both directions (e.g., `A friend_of B` AND `B friend_of A`).\n\n**Output Format**\n\nFormat the output strictly as two TSV code blocks (including the header row):\n\n``` tsv filename=\"entities.tsv\"\nid  name    label\n[data_rows]\ntsv filename=\"relationships.tsv\"\nsource_id   link    target_id\n[data_rows]\n```\n==End of user prompt==\n--------------------------------------------------------------------------------\n```\n\n🧪 Let's generate a knowledge graph:\n\n```\nknowledge_graph = generate_knowledge_graph(\n    BOOK_ANALYSIS_DATA_SCHEMA,\n    BOOK_ANALYSIS_INSTRUCTIONS,\n    text,\n    show_response=ShowAs.TEXT,\n)\n\nprint(knowledge_graph.entities)\nprint(knowledge_graph.relationships)\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :       534\nOutput tokens   :       244\n------------------------------ Start of Response -------------------------------\n``` tsv filename=\"entities.tsv\"\nid  name    label\n0   Henry Jones Jr. person\n1   Henry Jones Sr. person\n2   Sophie Jones    person\n3   William Smith   person\n4   Bill    person\n5   Beau    animal\n6   Short Round person\n7   Acme Aerospace  organization\ntsv filename=\"relationships.tsv\"\nsource_id   link    target_id\n0   son_of  1\n1   father_of   0\n0   father_of   2\n2   daughter_of 0\n2   friend_of   3\n3   friend_of   2\n3   alias_of    4\n4   alias_of    3\n3   employee_of 7\n7   employer_of 3\n2   employee_of 7\n7   employer_of 2\n3   owner_of    5\n5   pet_of  3\n6   friend_of   0\n0   friend_of   6\n0   adopted_father_of   6\n6   adopted_son_of  0\n```\n------------------------------- End of Response --------------------------------\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :   8\nRelationships :  18\n--------------------------------------------------------------------------------\n[Entity(id=0, name='Henry Jones Jr.', label='person'), Entity(id=1, name='Henry Jones Sr.', label='person'), Entity(id=2, name='Sophie Jones', label='person'), Entity(id=3, name='William Smith', label='person'), Entity(id=4, name='Bill', label='person'), Entity(id=5, name='Beau', label='animal'), Entity(id=6, name='Short Round', label='person'), Entity(id=7, name='Acme Aerospace', label='organization')]\n[Relationship(source_id=0, link='son_of', target_id=1), Relationship(source_id=1, link='father_of', target_id=0), Relationship(source_id=0, link='father_of', target_id=2), Relationship(source_id=2, link='daughter_of', target_id=0), Relationship(source_id=2, link='friend_of', target_id=3), Relationship(source_id=3, link='friend_of', target_id=2), Relationship(source_id=3, link='alias_of', target_id=4), Relationship(source_id=4, link='alias_of', target_id=3), Relationship(source_id=3, link='employee_of', target_id=7), Relationship(source_id=7, link='employer_of', target_id=3), Relationship(source_id=2, link='employee_of', target_id=7), Relationship(source_id=7, link='employer_of', target_id=2), Relationship(source_id=3, link='owner_of', target_id=5), Relationship(source_id=5, link='pet_of', target_id=3), Relationship(source_id=6, link='friend_of', target_id=0), Relationship(source_id=0, link='friend_of', target_id=6), Relationship(source_id=0, link='adopted_father_of', target_id=6), Relationship(source_id=6, link='adopted_son_of', target_id=0)]\n```\n\n💡 This is looking good!\n\nNow, let's go to the next stage and build a network graph from our data…\n\nNow that we have our entities and relationships neatly packed into data classes, let's bring them to life. We'll use `networkx`\n\nto build a network graph. Using domain terminology, entities become nodes and relationships become directed edges. We'll also calculate node centralities to identify key nodes and use the Louvain method to detect communities (clusters of closely related nodes)…\n\n``` python\nimport textwrap\nfrom typing import cast\n\nimport networkx as nx\nimport numpy as np\nfrom networkx.algorithms.community.louvain import louvain_communities\n\nNODE_CENTRALITY = \"node_centrality\"\nNODE_COMMUNITY_INDEX = \"node_community_index\"\nNODE_COLOR = \"node_color\"\nEDGE_COLOR = \"edge_color\"\n# Wrap names over multiple lines (avoids long node labels)\nMULTILINE_NAMES = True\nMULTILINE_CHARS = 12\n\ndef build_graph(kg: KnowledgeGraph, remove_orphan_nodes: bool) -> nx.DiGraph:\n    # For simplicity, we build a DiGraph (adapt to MultiDiGraph if needed)\n    graph = nx.DiGraph()\n\n    # Nodes ← Entities\n    node_name_from_id: dict[int, str] = {}\n    for entity in kg.entities:\n        node_name, display_name = get_node_and_display_names_for_entity(entity)\n        node_name_from_id[entity.id] = node_name\n        graph.add_node(node_name, name=display_name)\n\n    # Edges ← Relationships\n    for relationship in kg.relationships:\n        source_node = node_name_from_id.get(relationship.source_id, \"\")\n        target_node = node_name_from_id.get(relationship.target_id, \"\")\n        if not source_node or not target_node:\n            print(f\"❌ Skipping relationship due to empty node:\\n{relationship}\")\n            continue\n        # For simplicity, each link has the same weight\n        # For better community detection, define your own weight mappings (e.g., family members have stronger bonds)\n        weight = 1\n        edge_label = relationship.link\n        if graph.has_edge(source_node, target_node):\n            existing_data = graph[source_node][target_node]\n            existing_data[\"link\"] += f\"\\n{edge_label}\"\n            existing_data[\"weight\"] += weight\n        else:\n            graph.add_edge(source_node, target_node, link=edge_label, weight=weight)\n\n    if remove_orphan_nodes:\n        graph.remove_nodes_from(list(nx.isolates(graph)))\n\n    return graph\n\ndef get_node_and_display_names_for_entity(entity: Entity) -> tuple[str, str]:\n    snake_case_name = \"_\".join(map(str.lower, entity.name.split()))\n    node_name = f\"{entity.id}_{snake_case_name}\"\n\n    display_name = entity.name\n    if MULTILINE_NAMES:\n        display_name = \"\\n\".join(textwrap.wrap(display_name, width=MULTILINE_CHARS))\n\n    return node_name, display_name\n\ndef color_gen(color_count: int) -> Iterator[str]:\n    B50, R50, Y50, G50 = (\"#4285F4\", \"#EA4335\", \"#FBBC04\", \"#34A853\")\n    B20, R20, Y20, G20 = (\"#AECBFA\", \"#F6AEA9\", \"#FDE293\", \"#A8DAB5\")\n    B05, R05, Y05, G05 = (\"#E8F0FE\", \"#FCE8E6\", \"#FEF7E0\", \"#E6F4EA\")\n    COLORS = [B50, R50, Y50, G50, B20, R20, Y20, G20, B05, R05, Y05, G05]\n    for i in range(color_count):\n        yield COLORS[i % len(COLORS)]\n\nNode = str\nPositions = dict[Node, np.ndarray]\nCommunity = set[Node]\nCommunities = list[Community]\nNodes = list[Node]\nINTER_COMMUNITY_EDGE_COLOR = \"#8888\"\n\ndef init_graph_data(graph: nx.Graph) -> Nodes:\n    def node_centrality(node: Node) -> float:\n        return graph.nodes[node][NODE_CENTRALITY]\n\n    def community_max_centrality(community: Community) -> float:\n        return max(node_centrality(node) for node in community)\n\n    def nodes_sorted_by_community(communities: Communities) -> Nodes:\n        entities = []\n        for community in communities:\n            sorted_entities = sorted(community, key=node_centrality, reverse=True)\n            entities.extend(sorted_entities)\n        return entities\n\n    # Node centralities\n    centralities = nx.betweenness_centrality(graph, endpoints=True)\n    for node_key in graph.nodes:\n        graph.nodes[node_key][NODE_CENTRALITY] = centralities[node_key]\n\n    # Communities\n    communities = cast(Communities, louvain_communities(graph, seed=42))\n    sorted_communities = sorted(communities, key=community_max_centrality, reverse=True)\n\n    # Community colors\n    community_count = len(sorted_communities)\n    community_colors = list(color_gen(community_count))\n\n    # Node community indices and colors\n    for community_index, community in enumerate(sorted_communities):\n        for node_key in community:\n            node = graph.nodes[node_key]\n            node[NODE_COMMUNITY_INDEX] = community_index\n            node[NODE_COLOR] = community_colors[community_index]\n\n    # Edge colors\n    for node_key_i, node_key_j, edge_data in graph.edges(data=True):\n        node_i = graph.nodes[node_key_i]\n        node_j = graph.nodes[node_key_j]\n        same_community = node_i[NODE_COMMUNITY_INDEX] == node_j[NODE_COMMUNITY_INDEX]\n        edge_data[EDGE_COLOR] = (\n            node_i[NODE_COLOR] if same_community else INTER_COMMUNITY_EDGE_COLOR\n        )\n\n    return nodes_sorted_by_community(sorted_communities)\n\ndef compute_node_positions(graph: nx.DiGraph, entities: Nodes) -> Positions:\n    # Derive an undirected graph (preserving the entity order)\n    undirected_graph = nx.Graph()\n    for entity in entities:\n        undirected_graph.add_node(entity)\n    undirected_graph.add_edges_from(graph.edges())\n\n    if len(entities) < 10:\n        positions = nx.circular_layout(undirected_graph)\n    else:\n        positions = nx.kamada_kawai_layout(undirected_graph)\n        positions = nx.arf_layout(undirected_graph, positions, seed=42)\n\n    return positions\n\n@dataclass\nclass GraphData:\n    knowledge_graph: KnowledgeGraph\n    remove_orphan_nodes: bool = True\n    graph: nx.DiGraph = field(init=False)\n    nodes: Nodes = field(init=False)\n    positions: Positions = field(init=False)\n\n    def __post_init__(self) -> None:\n        self.graph = build_graph(self.knowledge_graph, self.remove_orphan_nodes)\n        self.nodes = init_graph_data(self.graph)\n        self.positions = compute_node_positions(self.graph, self.nodes)\n\nprint(\"✅ Network graph helpers defined\")\n✅ Network graph helpers defined\n```\n\nLet's test this:\n\n```\ngraph_data = GraphData(knowledge_graph)\n\nprint(f\"{graph_data.graph = !s}\")\nprint(f\"{graph_data.nodes = }\")\ngraph_data.graph = DiGraph with 8 nodes and 16 edges\ngraph_data.nodes = ['2_sophie_jones', '3_william_smith', '7_acme_aerospace', '5_beau', '4_bill', '0_henry_jones_jr.', '1_henry_jones_sr.', '6_short_round']\n```\n\nThe extracted data is much easier to understand when you can actually see it! We can use `matplotlib`\n\nto draw our network graphs. We'll size the nodes based on their centrality and color-code them by community. To make it even easier to digest, we'll generate an animated sequence highlighting each character's connections one by one…\n\n``` python\nimport typing\nfrom io import BytesIO\n\nimport matplotlib.pyplot as plt\nfrom IPython import display\nfrom matplotlib.axes import Axes\nfrom matplotlib.backends.backend_agg import FigureCanvasAgg\nfrom matplotlib.figure import Figure\nfrom PIL import Image as PilImage\n\nclass AnimationFormat(StrEnum):\n    # Matches PIL.Image.Image.format\n    WEBP = auto()\n    PNG = auto()\n    GIF = auto()\n\nFIGURE_DPI = 200\nFIGURE_FACTOR = 1.0\nANIMATION_INTRO_DURATION = 2500\nANIMATION_FRAME_DURATION = 250\nEDGE_STYLE = \"arc3,rad=0.2\"\nNodeSizes = dict[Node, int]\n\nif running_in_colab_env():\n    ANIMATION_FORMAT = AnimationFormat.WEBP\nelse:\n    ANIMATION_FORMAT = AnimationFormat.GIF\n\ndef init_figure(title: str, subtitle: str) -> tuple[Figure, Axes]:\n    figsize = (16 * FIGURE_FACTOR, 9 * FIGURE_FACTOR)\n    fig, ax = plt.subplots(figsize=figsize, dpi=FIGURE_DPI)\n    ax.set_title(title, loc=\"left\")\n    ax.set_title(subtitle, loc=\"right\")\n    ax.axis(\"off\")\n    fig.tight_layout(pad=2)\n\n    return fig, ax\n\ndef draw_nodes(graph: nx.Graph, positions: Positions, ax: Axes) -> NodeSizes:\n    node_view = graph.nodes(data=True)\n    node_sizes = {\n        node: max(500, int(10000 * data[NODE_CENTRALITY])) for node, data in node_view\n    }\n    node_colors = [str(data[NODE_COLOR]) for _, data in node_view]\n    border_width = 3.0\n\n    nx.draw_networkx_nodes(\n        graph,\n        pos=positions,\n        node_size=list(node_sizes.values()),\n        node_color=node_colors,\n        alpha=0.95,\n        ax=ax,\n        linewidths=border_width,\n    )\n    labels = {node: data.get(\"name\", str(node)) for node, data in node_view}\n    nx.draw_networkx_labels(graph, positions, labels=labels, ax=ax)\n\n    if len(node_sizes) >= 10:\n        # Extend axis limits (default rounding changes them too much otherwise)\n        (x_min, x_max), (y_min, y_max) = ax.get_xlim(), ax.get_ylim()\n        x_min, x_max = int(x_min - 1.0), int(x_max + 1.0)\n        y_min, y_max = int(y_min - 1.0), int(y_max + 1.0)\n        ax.set_xlim(x_min, x_max)\n        ax.set_ylim(y_min, y_max)\n\n    return node_sizes\n\ndef draw_edges(\n    graph: nx.DiGraph,\n    positions: Positions,\n    node_sizes: NodeSizes,\n    ax: Axes,\n    *,\n    focused_node: Node | None = None,\n) -> None:\n    # Draw specific out-edges or all edges\n    if focused_node:\n        out_edges = graph.edges([focused_node], data=True)\n    else:\n        out_edges = graph.edges(data=True)\n    edge_colors = [data[EDGE_COLOR] for _, _, data in out_edges]\n\n    edge_list = [(u, v) for u, v, _ in out_edges]\n    ordered_sizes = [node_sizes[n] for n in graph.nodes()]\n    nx.draw_networkx_edges(\n        graph,\n        positions,\n        edge_list,\n        edge_color=edge_colors,\n        style=\":\",\n        alpha=0.9,\n        arrowstyle=\"-|>\",\n        arrowsize=20,\n        ax=ax,\n        node_size=ordered_sizes,\n        connectionstyle=EDGE_STYLE,\n    )\n\n    edge_labels = {(u, v): data[\"link\"] for u, v, data in out_edges}\n    nx.draw_networkx_edge_labels(\n        graph,\n        positions,\n        edge_labels,\n        font_size=8,\n        font_family=\"monospace\",\n        bbox=dict(ec=\"#FFF8\", fc=\"#FFF8\"),\n        ax=ax,\n        node_size=ordered_sizes,  # type: ignore\n        connectionstyle=EDGE_STYLE,  # type: ignore\n    )\n\ndef display_graph(title: str, subtitle: str, graph_data: GraphData) -> None:\n    _, ax = init_figure(title, subtitle)\n\n    node_graph = graph_data.graph\n    edge_graph = graph_data.graph\n    positions = graph_data.positions\n    node_sizes = draw_nodes(node_graph, positions, ax)\n    draw_edges(edge_graph, positions, node_sizes, ax)\n\n    plt.show()\n\ndef yield_images(\n    title: str,\n    subtitle: str,\n    graph_data: GraphData,\n) -> Iterator[PilImage.Image]:\n    fig, ax = init_figure(title, subtitle)\n    canvas = FigureCanvasAgg(fig)\n\n    positions = graph_data.positions\n    node_graph = graph_data.graph\n    node_sizes = draw_nodes(node_graph, positions, ax)\n    edge_graph = graph_data.graph\n\n    for focused_node in [None, *graph_data.nodes]:\n        if focused_node is not None:\n            draw_edges(edge_graph, positions, node_sizes, ax, focused_node=focused_node)\n        canvas.draw()\n        image_size = canvas.get_width_height()\n        image_bytes = canvas.buffer_rgba()\n        yield PilImage.frombytes(\"RGBA\", image_size, image_bytes).convert(\"RGB\")\n    plt.close(fig)\n\ndef generate_animation(\n    title: str,\n    subtitle: str,\n    graph_data: GraphData,\n    format: AnimationFormat,\n) -> BytesIO:\n    frames = list(yield_images(title, subtitle, graph_data))\n    assert len(frames) >= 1\n\n    if format == AnimationFormat.GIF:\n        # Dither all frames with the same palette to optimize the animation\n        # The animation is cumulative, so most colors are in the last frame\n        method = PilImage.Quantize.MEDIANCUT\n        palettized = frames[-1].quantize(method=method)\n        frames = [frame.quantize(method=method, palette=palettized) for frame in frames]\n\n    # The animation will be played in a loop: start cycling with the most complete frame\n    first_frame = frames[-1]\n    next_frames = frames[:-1]\n    durations = [ANIMATION_INTRO_DURATION]\n    durations += [ANIMATION_FRAME_DURATION] * len(next_frames)\n    params: dict[str, typing.Any] = dict(\n        save_all=True,\n        append_images=next_frames,\n        duration=durations,\n        loop=0,\n    )\n    match format:\n        case AnimationFormat.GIF:\n            params.update(optimize=False)\n        case AnimationFormat.PNG:\n            params.update(optimize=True)\n        case AnimationFormat.WEBP:\n            params.update(lossless=True)\n\n    image_io = BytesIO()\n    first_frame.save(image_io, str(format).upper(), **params)\n\n    return image_io\n\ndef display_graph_animation(title: str, subtitle: str, graph_data: GraphData) -> None:\n    image_io = generate_animation(title, subtitle, graph_data, ANIMATION_FORMAT)\n    ipython_image = display.Image(data=image_io.getvalue())\n    display.display(ipython_image)\n\ndef get_graph_title_and_subtitle(\n    source: Source | str,\n    model: Model | None = None,\n    domain: str = \"\",\n) -> tuple[str, str]:\n    if isinstance(source, str):\n        source_name = f'\"{source.strip()[:20]}…\"'\n    else:\n        source_name = source.name\n    model_id = model.value if model else Model.DEFAULT.value\n    model_id = model_id.removesuffix(\"-preview\")\n    separator = \" • \"\n    title_parts = [\"Knowledge Graph\"]\n    if domain:\n        title_parts.append(domain)\n    subtitle_parts = [source_name, model_id]\n    return separator.join(title_parts), separator.join(subtitle_parts)\n\ndef display_knowledge_graph(\n    knowledge_graph: KnowledgeGraph,\n    source: Source | str,\n    model: Model | None = None,\n    animated: bool = False,\n    domain: str = \"\",\n    remove_orphan_nodes: bool = True,\n) -> None:\n    if not knowledge_graph.entities:\n        return\n    title, subtitle = get_graph_title_and_subtitle(source, model, domain)\n    graph_data = GraphData(knowledge_graph, remove_orphan_nodes)\n\n    if animated:\n        display_graph_animation(title, subtitle, graph_data)\n    else:\n        display_graph(title, subtitle, graph_data)\n\ndef extract_knowledge_graph(\n    data_schema: str,\n    instructions: str,\n    source: Source | str,\n    model: Model | None = None,\n    *,\n    domain: str = \"\",\n    animated: bool = False,\n) -> None:\n    if isinstance(source, Source):\n        display_input_data_caption(source)\n    knowledge_graph = generate_knowledge_graph(\n        data_schema,\n        instructions,\n        source,\n        model=model,\n    )\n    display_knowledge_graph(knowledge_graph, source, model, animated, domain)\n\nprint(\"✅ Data visualization helpers defined\")\n✅ Data visualization helpers defined\n```\n\nLet's test this:\n\n```\ndisplay_knowledge_graph(knowledge_graph, text)\n```\n\n💡 We can now quickly visualize and understand our knowledge graphs. This helps us iterate faster when refining prompts.\n\nℹ️ While this simple approach is great for a quick overview, you might want to swap it out for more specialized libraries if you want to explore the graph interactively.\n\nLet's define a book analysis function:\n\n``` python\ndef analyze_book(\n    source: Source,\n    model: Model | None = None,\n    animated: bool = False,\n) -> None:\n    extract_knowledge_graph(\n        BOOK_ANALYSIS_DATA_SCHEMA,\n        BOOK_ANALYSIS_INSTRUCTIONS,\n        source,\n        model,\n        domain=\"Character Connections\",\n        animated=animated,\n    )\n```\n\n🧪 First, let's build a knowledge graph based on Zola's *Thérèse Raquin*:\n\n```\nanalyze_book(Classic.fr_zola_thérèse_raquin)\n```\n\n**Input data** ([fr_zola_thérèse_raquin](https://gutenberg.org/cache/epub/7461/pg7461.txt))\n\n```\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :   119,518\nCached tokens   :   118,753\nOutput tokens   :       357\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :  11\nRelationships :  28\n--------------------------------------------------------------------------------\n```\n\n💡 We can extract and understand the book's narrative in seconds: The love triangle between Thérèse, Camille, and Laurent clearly stands out. Despite being over 200 pages long (and 100k+ tokens), this novel is incredibly minimalistic, revolving around a limited set of characters, which is reflected in the network graph. Adding locations to the extracted entities would also reveal the claustrophobic atmosphere of the book in the resulting knowledge graph.\n\nℹ️ We used the original French version with English instructions, which works seamlessly. If you translate the instructions, you can also generate dynamic relationship links in different languages (see the [100+ supported languages](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models#expandable-1)).\n\n🧪 Let's see how the model handles the interconnected cast of Hugo's *Les Misérables*:\n\n```\nanalyze_book(\n    Classic.en_hugo_les_misérables,\n    model=Model.GEMINI_3_5_FLASH,\n)\n```\n\n**Input data** ([en_hugo_les_misérables](https://gutenberg.org/cache/epub/135/pg135.txt))\n\n```\n-------------------------- Request / gemini-3.5-flash --------------------------\nInput tokens    :   803,328\nOutput tokens   :     2,908\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :  65\nRelationships : 225\n--------------------------------------------------------------------------------\n```\n\n💡 We quickly get the gist of the novel.\n\nℹ️ Donald Knuth famously used *Les Misérables* as an example back in 1994 (see the [Stanford GraphBase](https://www-cs-faculty.stanford.edu/~knuth/sgb.html)). It's been consistently used in Natural Language Processing (NLP) because of its massive scale, linguistic complexity, and multiple high-quality translations. At 800k+ tokens, this book is still a true stress test. Note that we used Gemini 3.5 Flash (instead of 3.1 Flash-Lite). Larger models can infer more and perform deeper consolidation.\n\n🧪 Now, let's process just the first volume of *Le Comte de Monte-Cristo* (in French):\n\n```\nanalyze_book(Classic.fr_dumas_comte_de_monte_cristo_1)\n```\n\n**Input data** ([fr_dumas_comte_de_monte_cristo_1](https://gutenberg.org/cache/epub/17989/pg17989.txt))\n\n```\n----------------------- Request / gemini-3.1-flash-lite ------------------------\nInput tokens    :   215,582\nOutput tokens   :       713\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :  36\nRelationships :  36\n--------------------------------------------------------------------------------\n```\n\n💡 The graph shows the initial setup of Edmond Dantès' world, his friends, and his betrayers. The love triangle between Edmond, Mercédès, and Fernand also comes to light.\n\nℹ️ Note that the antagonist is referred to only as *Fernand*, which is expected. We learn his full name, *Fernand Mondego*, only in volume three. In the prompt, we asked the model to \"extract\" entities and determine names \"from the input data\" to avoid generating memorized knowledge. To detect regressions when updating the prompt, we can use this as a unit test to ensure that our response actually provides extracted data and not memorized information.\n\n🧪 And finally, let's analyze the entire saga of *Le Comte de Monte-Cristo*, all four volumes at once (800k+ tokens):\n\n```\nanalyze_book(\n    Collection.fr_dumas_comte_de_monte_cristo,\n    model=Model.GEMINI_3_5_FLASH,\n    animated=True,\n)\n```\n\n**Input data** ([fr_dumas_comte_de_monte_cristo_1](https://gutenberg.org/cache/epub/17989/pg17989.txt), [fr_dumas_comte_de_monte_cristo_2](https://gutenberg.org/cache/epub/17990/pg17990.txt), [fr_dumas_comte_de_monte_cristo_3](https://gutenberg.org/cache/epub/17991/pg17991.txt), [fr_dumas_comte_de_monte_cristo_4](https://gutenberg.org/cache/epub/17992/pg17992.txt))\n\n```\n-------------------------- Request / gemini-3.5-flash --------------------------\nInput tokens    :   840,385\nCached tokens   :   835,551\nOutput tokens   :     2,659\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :  50\nRelationships : 212\n--------------------------------------------------------------------------------\n```\n\n💡 The multiple aliases of Edmond Dantès (and other characters) are nicely extracted. The complex plot of this book is built on characters juggling multiple identities in a psychological chess match.\n\n🎉 This is an example of how we can complete the challenge in a single request, using the fewest tokens possible and zero thinking tokens, while providing multiple levels of consolidation.\n\nℹ️ Note that this is a proof of concept. For an exhaustive extraction in a fully professional solution, we would probably set up a multi-step workflow and consider the following:\n\n`location`\n\n, `date`\n\n…), and then save them to a database.`evidence`\n\nor `excerpt`\n\nto serve as direct proof, or `chapter`\n\nor `page`\n\nfor source attribution). In this step, we could also dig deeper and differentiate literal/figurative relationships (e.g., biological vs. figurative parents).With our current setup, generalizing to other types of content is as simple as adapting our data schema and instructions.\n\nFor example, legal agreements are among the densest types of documents. They're usually made up of many articles and clauses, with every sentence carrying specific legal weight, outlining obligations, or providing definitions. But what kind of knowledge graph do we want to build from a legal agreement?\n\n🧪 Let's test a minimalistic, open-ended prompt:\n\n```\nsource = Document.en_pharma_dev_agreement\nmodel = Model.GEMINI_3_5_FLASH\n\nAGREEMENT_OPEN_DATA_SCHEMA = \"\"\"\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Name of the entity.\n- `label`: Entity type.\n\nRelationship:\n- `source_id`: `id` of the subject entity.\n- `link`: `snake_case` predicate.\n- `target_id`: `id` of the object entity.\n\"\"\"\n\nAGREEMENT_OPEN_INSTRUCTIONS = \"\"\"\nPerform a comprehensive, highly-granular entity/relationship extraction.\n\"\"\"\n\nextract_knowledge_graph(\n    AGREEMENT_OPEN_DATA_SCHEMA,\n    AGREEMENT_OPEN_INSTRUCTIONS,\n    source,\n    model,\n    domain=\"Agreement High-Level Extraction\",\n)\n```\n\n**Input data** ([en_pharma_dev_agreement](https://storage.googleapis.com/cloud-samples-data/documentai/ContractDocAI/CUAD_v1/Part_I/Development/PhasebioPharmaceuticalsInc_20200330_10-K_EX-10.21_12086810_EX-10.21_Development%20Agreement.pdf))\n\n```\n-------------------------- Request / gemini-3.5-flash --------------------------\nInput tokens    :    51,276\nOutput tokens   :       221\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :  10\nRelationships :   9\n--------------------------------------------------------------------------------\n```\n\n💡 Remarks\n\n🧪 Then, let's try semi-open instructions focusing on legal obligations:\n\n```\nclass AgreementEntityLabel(StrEnum):\n    PARTY = auto()\n    PERSON = auto()\n    ROLE = auto()\n    LOCATION = auto()\n    JURISDICTION = auto()\n    FINANCIAL_AMOUNT = auto()\n    DATE = auto()\n    EVENT = auto()\n    ASSET = auto()\n    PRODUCT = auto()\n    INTELLECTUAL_PROPERTY = auto()\n    OBLIGATION_TYPE = auto()\n\nAGREEMENT_SEMI_OPEN_DATA_SCHEMA = f\"\"\"\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Name of the entity.\n- `label`: {pipe_delimited_union(AgreementEntityLabel)}.\n\nRelationship: Obligation, right, or transfer from a source entity to a target entity.\n- `source_id`: The `id` of the subject entity.\n- `link`: Specific `snake_case` predicate.\n- `target_id`: The `id` of the object entity.\n\"\"\"\n\nAGREEMENT_SEMI_OPEN_INSTRUCTIONS = \"\"\"\n- Analyze the input data for every covenant (e.g., \"shall\", \"will\", \"must\", \"is obligated to\") and perform an exhaustive extraction.\n- Make sure to deconstruct complex obligations: For complex clauses (e.g., \"A shall pay B $X Million within Y days of the Effective Date\"), extract:\n  - The primary obligation: `[A] is_obligated_to_pay [B]`\n  - The value: `[A's obligation] subject_to [$X Million]`\n  - The trigger: `[A's obligation] triggered_by [Effective Date]`\n\"\"\"\n\nextract_knowledge_graph(\n    AGREEMENT_SEMI_OPEN_DATA_SCHEMA,\n    AGREEMENT_SEMI_OPEN_INSTRUCTIONS,\n    source,\n    model,\n    domain=\"Agreement Obligations\",\n)\n```\n\n**Input data** ([en_pharma_dev_agreement](https://storage.googleapis.com/cloud-samples-data/documentai/ContractDocAI/CUAD_v1/Part_I/Development/PhasebioPharmaceuticalsInc_20200330_10-K_EX-10.21_12086810_EX-10.21_Development%20Agreement.pdf))\n\n```\n-------------------------- Request / gemini-3.5-flash --------------------------\nInput tokens    :    51,466\nOutput tokens   :     2,002\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      :  85\nRelationships :  93\n--------------------------------------------------------------------------------\n```\n\n💡 Remarks\n\nWhat if we don't care about legal obligations, but rather the document's architecture? Let's shift our focus to the structure itself and extract how sections, clauses, and defined terms are hierarchically organized…\n\n🧪 And now, let's test closed instructions focusing on the document structure:\n\n```\nclass AgreementStructureEntityLabel(StrEnum):\n    DEFINED_TERM = auto()\n    DOCUMENT_SECTION = auto()\n    DOCUMENT = auto()\n\nclass AgreementStructureRelationshipType(StrEnum):\n    DEFINED_IN = auto()\n    CONTAINS = auto()\n\nAGREEMENT_STRUCTURAL_DATA_SCHEMA = f\"\"\"\nEntity:\n- `id`: Unique integer identifier (0, 1, 2…).\n- `name`: Name of the entity.\n- `label`: {pipe_delimited_union(AgreementStructureEntityLabel)}.\n\nRelationship: Connection from a source entity to a target entity.\n- `source_id`: The `id` of the subject entity.\n- `link`: {pipe_delimited_union(AgreementStructureRelationshipType)}.\n- `target_id`: The `id` of the object entity.\n\"\"\"\n\nAGREEMENT_STRUCTURAL_OPEN_INSTRUCTIONS = \"\"\"\n- Extract every distinct entity that matches an allowed `label`.\n- Extract every distinct relationship representing a structural connection (hierarchical organization) between these entities:\n  - You must be comprehensive and highly granular. If multiple distinct relationships exist between the same pair of entities, create a separate entry for each.\n\"\"\"\n\nextract_knowledge_graph(\n    AGREEMENT_STRUCTURAL_DATA_SCHEMA,\n    AGREEMENT_STRUCTURAL_OPEN_INSTRUCTIONS,\n    source,\n    model,\n    domain=\"Agreement Structure\",\n)\n```\n\n**Input data** ([en_pharma_dev_agreement](https://storage.googleapis.com/cloud-samples-data/documentai/ContractDocAI/CUAD_v1/Part_I/Development/PhasebioPharmaceuticalsInc_20200330_10-K_EX-10.21_12086810_EX-10.21_Development%20Agreement.pdf))\n\n```\n-------------------------- Request / gemini-3.5-flash --------------------------\nInput tokens    :    51,351\nOutput tokens   :     9,108\n----------------------------- Knowledge Graph Info -----------------------------\nEntities      : 314\nRelationships : 513\n--------------------------------------------------------------------------------\n```\n\n💡 If you extract hundreds of entities from a massive document, your graph will quickly turn into an unreadable hairball. For larger datasets, you'll want to export your nodes and edges to a dedicated graph database, which typically comes with its own visualization and exploration tools.\n\nWe successfully extracted data and built knowledge graphs from documents by following these steps:\n\nThese principles apply to many other data-extraction domains and will allow you to solve your own complex problems. Have fun and happy solving!\n\n➕ More!", "url": "https://wpnews.pro/news/building-knowledge-graphs-with-gemini", "canonical_source": "https://dev.to/googleai/building-knowledge-graphs-with-gemini-3ail", "published_at": "2026-06-12 17:37:01+00:00", "updated_at": "2026-06-12 17:40:57.651813+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "generative-ai", "natural-language-processing", "ai-tools"], "entities": ["Gemini", "Gemini API", "Agent Platform", "Google Cloud"], "alternates": {"html": "https://wpnews.pro/news/building-knowledge-graphs-with-gemini", "markdown": "https://wpnews.pro/news/building-knowledge-graphs-with-gemini.md", "text": "https://wpnews.pro/news/building-knowledge-graphs-with-gemini.txt", "jsonld": "https://wpnews.pro/news/building-knowledge-graphs-with-gemini.jsonld"}}