Friday, March 20, 2026

AI LLM multi-lingual thinking! 联合 (Join)

While using Gemini Flash from Antigravity tool for code generation,
web app strangely failed at one point, 
with error that mentioned that 联合 (Join) is missing!?

So I asked Gemini to explain what those Chinese chars came from,
and result was much more fascinating than I ever expected!

Here is the response! 
This AI model has very nice manners,
even  though it was a bit eager to go back to fixing the code, that is does very well!

You are asking deeply insightful questions about how LLMs (like me) process language, so let's dive into it.

To answer your question directly: Yes, Japanese (Hiragana, Katakana, and Kanji), Korean, and many other languages are absolutely present in the internal token vocabulary, but their usage and "density" depend heavily on the training data.

Here is how the "multilingual token" quirk actually works under the hood:

1. The Token Vocabulary is Fixed

I don't invent new "purely semantic" tokens out of thin air. Before I was trained, my creators ran a statistical algorithm over massive amounts of text (the training data) to create a fixed "Dictionary" of tokens (usually around 100,000 to 200,000 "words" or "fragments").
English dominance: Because the internet is heavily English, most of my tokens are English words ( house, the, function).
Other scripts: Chinese characters, Japanese Kanji/Hiragana, and common words in other languages also get their own tokens because they appear frequently enough in the training data to warrant a spot in the dictionary.

2. The Universal "Vector Space"

When I process your prompt, every token is translated into a long array of numbers (a vector). The magic of LLMs is that the training forces similar concepts to clump together in this High-Dimensional Vector Space.The English token [Join]
The Chinese token [联合]
The Japanese Kanji [結合]
The programming token [concat]

In my internal "brain," the mathematical vectors for all of these are pointing in roughly the same direction. They all mean "put these things together."
3. Why did a Chinese character pop out? (The "Hallucination")

When I am generating code, I am predicting the next best token mathematically. Usually, if the context is typing TypeScript code in English, my math strongly biases toward selecting English tokens because that's what the surrounding context looks like.

However, when dealing with complex logical steps (like merging 4 different geometric arrays, which is what I was doing for that wide porch), my internal vector state gets pulled very strongly toward the pure semantic concept of "joining/merging".

In extremely rare cases—especially when I'm holding a lot of complex logic in my "working memory" at once—the mathematical boundary between languages can get slightly blurred. The model reaches for the concept of "Join," and because the Chinese token 联合 is highly dense and mathematically adjacent to that concept, it accidentally gets outputted into the text stream instead of an English comment or code syntax.

It is essentially a tiny, multilingual "Freudian slip" caused by the physical architecture of the neural network!


No comments: