Pages

Tuesday, November 18, 2025

Best practices text segmentation

Best practices for text segmentation in translation include using Computer-Assisted Translation (CAT) tools to break text into meaningful, logical units like sentences or phrases, ensuring segments are concise and retain complete units of meaning to fit short-term memory, and prioritizing consistency through defined rules and Translation Memories (TMs) to improve quality and efficiency. Additionally, proper source document formatting, including clear structure and avoiding unnecessary formatting, helps CAT tools parse content correctly and create clearer segments. 

Understanding Text Segmentation
  • Definition: Text segmentation is the process of dividing a source text into smaller, translatable units, called "segments". 
Best Practices for Text Segmentation

  1. Break into meaningful units: Segment text at natural linguistic boundaries, such as sentences, rather than arbitrary points like every 5-10 words.
  2. Keep segments concise: Segments should be short enough for a translator to easily retain the information in their short-term memory.
  3. Ensure completeness: Each segment should represent a complete thought or unit of meaning to avoid confusion and an unnatural translation.
  4. Utilize CAT Tools & Translation Memories: CAT tools, when configured with appropriate segmentation rules, help identify and manage these units. This promotes consistency and allows for the reuse of previously translated segments in a Translation Memory.
  5. Format source content well: Clear, well-organized source documents with consistent formatting (e.g., proper use of paragraph breaks, hard returns, and page breaks) ensure that CAT tools can parse the content correctly and produce unambiguous segments.
  6. Define segmentation rules: Establish clear rules, often in a format like SRX (Segmentation Rules Exchange), to define how text should be broken down for specific projects or language pairs.
  7. Perform automated QA checks: After segmentation and translation, automated quality assurance checks help identify and correct errors like misspellings or incorrect terminology, preventing them from being added to the Translation Memory.
  8. Prioritize consistency: Over time, consistent segmentation practices, combined with well-maintained TMs, significantly increase content reuse, reduce translation costs, and boost overall translation quality. 

  • Purpose: It makes translation faster, easier, and more consistent by allowing translators to focus on smaller, logical chunks of text. 
  • Tools: Segmentation is a foundational step in Computer-Assisted Translation (CAT) tools and is configured using specific segmentation rules. 

New kid on the block

 Try inter-linear.com

Enter text or upload a file, select a translation service, submit this source for translation and display to copy or download. The display, in contrast to side-by-side or parallel texts, can be segmented by sentences, paragraphs, word/phrase, or characters per line.

Additionally there are tools for converting source and translation into inter-linear displays as well as segmenting text into sentences.

Expect output suitable for everyday languages enthusiasts, not for scholarly or biblical exegesis.

Provide feedback on your experience with inter-linear displays and the results you got or wanted, or just enjoy the possibilities.

 

Thursday, October 2, 2025

Best strategy? for AI inter-linea display

 The best strategy for producing an interlinear display of a translation using Google Translate is to translate the complete submission first and then use client-side logic (JavaScript/CSS) for display conversion.


Analysis of Strategies

  1. Translate Complete Submission First (Recommended)

    • Process: Send the full sentence or segment to the Google Cloud Translation API. The API uses Neural Machine Translation (NMT), which translates the entire sentence at once to ensure contextual accuracy and capture nuances, idiomatic expressions, and complex grammar.1 The complete translated text is returned. Client-side code (JS/CSS) is then used to align the original text with the translated text for an interlinear (line-by-line or word-under-word) display, often by segmenting the text into smaller chunks (e.g., words, phrases) and attempting to map them, or simply displaying the source text and the full translation on separate lines.

    • Pros:

      • Highest Quality Translation: NMT produces the most human-like and contextually accurate result.2

      • Efficiency: Fewer API calls are made (one per sentence/segment), which is generally faster and more cost-effective.

    • Cons:

      • Alignment Challenge: The primary difficulty is that Google's NMT does not provide a direct word-to-word or phrase-to-phrase mapping in its standard API response. Perfect alignment for an interlinear display is often an extremely complex, if not impossible, computational challenge without the underlying translation model data.

  2. Translate Words and Phrases Iteratively (Not Recommended)

    • Process: Break the source text down into individual words or small phrases and make a separate API call for each one. Then, display the original word and its corresponding translation.

    • Pros:

      • Simple Alignment: Alignment is trivial since each call corresponds to a single word/phrase.

    • Cons:

      • Poor Translation Quality: Google's NMT is specifically designed to work on full sentences to understand context.3 Translating isolated words or short phrases loses context, resulting in poor, literal, and often incorrect translations.4

      • Inefficiency & Cost: This dramatically increases the number of API calls, leading to higher latency and significantly greater costs.


Conclusion

Since Google Translate's core strength is its contextual Neural Machine Translation, you should prioritize translation quality by sending the full segment. The trade-off is that you will need to implement a sophisticated (and likely imperfect) client-side logic to attempt to align the words for the interlinear display.

The other strategy involves losing the key benefit of Google Translate's technology.

To see how to get started with the Google Translate API, watch ... video: How to use the Google API to translate texts. This video is relevant because it demonstrates using the Google Translate API, which is the necessary first step for either strategy.


Saturday, September 20, 2025

Environment and AI solutions

Soubor:Google Gemini logo.svg 

20.09.25, ref. per Gemini

Question. Which is more friendly to the environment, for example in terms of electricity usage, carbon footprint, etc., client-side or server-side AI solutions? 

Response. In general, client-side AI solutions are more environmentally friendly than server-side solutions, but the answer is complex and depends on many factors. Client-side solutions, which run on a user's local device, eliminate or reduce the need for massive, energy-intensive data centers for inference (the use of a trained AI model). However, the training of these models, regardless of where they are ultimately deployed, remains a major environmental concern.

***

### Server-Side AI: The Environmental Burden

Server-side AI relies on large-scale data centers, which are one of the biggest contributors to the environmental footprint of AI. These data centers consume immense amounts of electricity for three main purposes:

* **Training and Inference:** The process of training a large language model (LLM) like GPT-4 can consume a staggering amount of energy. While a single query (inference) uses a tiny fraction of that energy, the sheer volume of billions of daily queries for popular models adds up to a massive cumulative carbon footprint.
* **Cooling:** Data centers generate a tremendous amount of heat. To prevent hardware from overheating, they require sophisticated and highly energy-intensive cooling systems, which can also consume millions of gallons of water annually.
* **Idle Capacity:** To ensure high availability and responsiveness, data centers must maintain a degree of idle capacity, which still consumes energy while waiting to handle traffic spikes.

The carbon footprint of server-side AI also depends on the **location** of the data center and its energy source. A data center running on a grid powered by renewable energy like wind or solar will have a significantly lower carbon footprint than one relying on fossil fuels.

### Client-Side AI: The Greener Alternative

Client-side, or on-device, AI runs directly on a user's device, such as a smartphone or laptop. Because the processing happens locally, it avoids the energy consumption and carbon emissions associated with data transmission and large data centers.

While client-side AI is more efficient for **inference**, it's crucial to understand a few key points:

* **Training is still a major factor:** The initial training of the AI model, whether it's for on-device or server-side use, is the most energy-intensive part of the process. This training is almost always done in a data center.
* **Limited Capabilities:** On-device models are typically smaller and less powerful than their server-side counterparts. This is because consumer devices lack the massive computational power and memory of a data center.
* **Device-level Consumption:** While they bypass the data center, on-device models still use a device's battery and CPU, which consumes energy. This can lead to faster battery drain and generate some heat.

### A Nuanced Comparison

Ultimately, the choice between client-side and server-side AI for a given application is a trade-off between performance and environmental impact. For computationally demanding tasks like training a large generative AI model, a server-side solution is currently the only option. For tasks that can be performed by smaller models, such as basic text completion or image classification, client-side AI is a much more energy-efficient and environmentally responsible choice.

As on-device hardware becomes more powerful and AI models are optimized to be smaller and more efficient, we'll likely see a shift toward more client-side solutions, reducing the reliance on energy-hungry data centers for many everyday AI applications.

Musing futures

Long-form interlinear translations are available online, that is ones not having to do with biblical or scriptural texts. Fiction mostly, but there is some non-fiction to be found. The interlinear examples, many published for online or physical book reading, show words/phrases aligned with the source's text. This is a first step toward getting closer and closer to contextual meaning, e.g., based on historical understanding of language items, hermeneutics, etymology of a word, ngram (frequency of word usage over time), etc.

Which of the current tools included in the set of experiments shows promise for use in creating long-form interlinear translations (of course, here we refer to the softened understanding of interlinear here termed inter-linea). For discussion purposes, see https://jkm.free.nf/inter-linea/w4word.html.

A product of the above word/phrase approach based on character count limitation per line is below, without entirely correct alignment and no glossing, but serviceable I believe. If so, perhaps a more rigorous(?) translation service might be employed, something like DeepL? Then the 
word/phrase approach (versus others on the same site) might function as a prototype for long-form translations . . . to publish?

In essence, Bohr and Heisenberg argued that science had finally caught up with the conceptual problems involved in the description of reality that philosophers had been warning of for centuries. Bohr is quoted as saying: "There is no quantum world. There is only an abstract quantum physical description. It is wrong to think that the task of physics is to find out how nature is. Physics concerns what we can say about nature." This vaguely positivist statement was echoed by Heisenberg:. "[W]e have to remember that what we observe is not nature in itself but nature exposed to our method of questioning." 

Their broadly anti-realist "Copenhagen interpretation" "denying" that the wave function represents the real physical state of a quantum system quickly became the dominant way of thinking about quantum mechanics. More recent variations of such anti-realist interpretations suggest that the wave function is simply a way of "coding" our experience, or our subjective beliefs derived from our experience of the physics, allowing us to use what we've learned in the past to predict the future.

The word/phrase display.

In essence, Bohr and Heisenberg argued that
Im Wesentlichen argumentierten Bohr und Heisenberg das

science had finally caught up with the conceptual
Die Wissenschaft hatte endlich das Konzept eingeholt

problems involved in the description of reality
Probleme, die an der Beschreibung der Realität verbunden sind

that philosophers had been warning of for
dass Philosophen vor gewarnt hatten

centuries. Bohr is quoted as saying: "There is no
Jahrhunderte. Bohr wird mit den Worten zitiert: "Es gibt keine

quantum world. There is only an abstract quantum
Quantenwelt. Es gibt nur ein abstraktes Quantum

physical description. It is wrong to think that
Physische Beschreibung. Es ist falsch zu denken, dass

the task of physics is to find out how nature is.
Die Aufgabe der Physik ist es, herauszufinden, wie die Natur ist.

Physics concerns what we can say about nature."
Die Physik betrifft das, was wir über die Natur sagen können. "

This vaguely positivist statement was echoed by
Diese vage positivistische Aussage wurde von wiederholt von

Heisenberg:. "[W]e have to remember that what we
Heisenberg:. "[Wir müssen uns daran erinnern, was wir

observe is not nature in itself but nature
beobachten ist nicht die Natur an sich, sondern die Natur

exposed to our method of questioning."
unserer Befragungsmethode ausgesetzt. "

Their broadly anti-realist "Copenhagen
Ihr weitläufig anti-realistischer "Kopenhagen"

interpretation" "denying" that the wave function
Interpretation "" leugnen ", dass die Wellenfunktion fungiert

represents the real physical state of a quantum
repräsentiert den wirklichen physischen Zustand eines Quanten

system quickly became the dominant way of
Das System wurde schnell zur dominierenden Art von

thinking about quantum mechanics. More recent
Denken Sie an die Quantenmechanik. Neuer

variations of such anti-realist interpretations
Variationen solcher antirrealistischer Interpretationen

suggest that the wave function is simply a way of
Schlagen Sie vor, dass die Wellenfunktion einfach eine Möglichkeit ist

"coding" our experience, or our subjective
"codieren" unsere Erfahrung oder unsere subjektive

beliefs derived from our experience of the
Überzeugungen abgeleitet aus unserer Erfahrung der

physics, allowing us to use what we've learned in
Physik, damit wir das verwenden können, was wir gelernt haben

the past to predict the future.
die Vergangenheit, die Zukunft vorherzusagen.

An enhancement to this display might be to show the target before the source, where one wanted to create an inter-linea for more than personal use. And one of the two might be in a smaller font or colored in some way to reduce distraction/dependence on one rather than the other language.