Fido Language Detection Under the Hood

Fido Language Detection Under the Hood

Problem Statement

Correct language markup of text content is essential for users of screen readers (speech and braille) and read aloud. Manually marking up text can be time intensive and prone to error.

Hypothesis

Language models could assist in identifying paragraphs and phrases and tag them with the appropriate ISO 639-1 codes. This functionality could be useful in multiple formats (DAISY, EPUB, HTML, Word etc).

Initial Research with LLMs

Quick feasibility test with LLM chatbots (Chat GPT, Claude, Gemini) showed some promise.

Conversation with ChatGPT 5

Prompt: "You are a linguistic expert. Your task is to example a phrase and identify the language that is being used. Respond with the appropriate ISO 639-1 code." Response: "Understood. Please provide the phrase you’d like me to analyze, and I’ll return the appropriate ISO 639-1 language code."

Prompt: "My tailor is rich." Response: "en"

Prompt: "Mein Bruder hat die gleiche Harpune wie du" Response: "de"

Prompt: "Dw i'n hoffi coffi" Response: "cy"

Prompt: "Mein Bruder hat die gleiche Harpune wie du" Response: "de"

Initial Research with Shallow Machine Learning and Neural Network Techniques

The text quantity that needs checking for accessible book production/mediation will be large and so processing this though an LLM will be expensive and slow. Perhaps the task can be achieved using filtering can be done using statistical or shallow neural classifiers.

Three possibilities were identified as being fast, local solutions that could be easily integrated into a Python codebase.

Tool Languages Model Type Accuracy
langid.py 97 Shallow ML language classifier Naive Bayes (char n-grams) Medium
langdetect 55+ Shallow ML language classifier Naive Bayes (char n-grams) Medium
fastText 176 Task-specific neural model Shallow NN (embeddings + linear classifier) High

Sliding Window

Tools like langid.py, langdetect, fastText usually classify entire chunks of text. If a paragraph contains multiple languages, the classifier often only identifies the dominant language. Some will report additional languages with a probability score but greater accuracy is needed for this application.

To identify phrases in different languages a sliding window approach was employed. Instead of passing the whole paragraph to the classifier, it is broken into smaller overlapping chunks (windows). The window is slid across the text, classifying each chunk separately. This allows the detection of changes in language at the sub-sentence level. A few parameters were instrumented, including window size in characters, minimum word length and confidence levels.

However, based on the samples being used, this technique was found to be unreliable. If the parameters were adjusted too far in one direction there were false positives of exotic languages. Too far in the other direction and phrases in other languages were missed.

Sentence/Clause Detection

Ultimately it was decided to implement a proof-of-concept solution based at sentence or clause level detection.

The input text is initially split based on ‘strong’ punctuation which is . ? ! and CJK equivalents. If the text is too short, then it is concatenated with the following sentence.

If the text is longer than 200 characters, then is split if there is ‘weak’ punctuation (, ; : —) and CJK equivalents.

The phrases are then classified using the language detector (currently fasttext is implemented). The output is a language code and confidence level, for example:

'en', 0.95

If the confidence is low or the phrase it very short, then algorithmically we decide whether the phrase inherits the language of the previous phrase, or the base language of the document.

The phrases of the same language are joined together and finally we return each segment with:

lang: ISO 639-1 code

start_index: Character position (zero-based)

end_index: Character position (exclusive)

confidence: Average confidence (0-1)

snippet: First 3 words of segment

end_snippet: Last 3 words of segment

Language Markup Using Large Language Models

The input stage is simple relative to the shallow ML/neural network technique.

Receive base language and paragraphs that comprise an id and paragraph text.

These are submitted to the LLM together with the prompt.

Role: You are a high-precision linguistic analysis tool specialized in language transition detection and character-level indexing.

Task: Analyze the provided JSON object containing multiple paragraphs. For each paragraph, identify every segment of text by its language.

Constraints:

  1. Detect every transition from one language to another.
  2. Provide the ISO 639-1 code for every segment.
  3. Provide ‘start_index’: the EXACT zero-based character index of the first character of the segment.
  4. Provide ‘end_index’: the EXACT zero-based character index of the last character of the segment.
  5. Provide ‘snippet’: the first 3 words of the segment.
  6. Provide ‘end_snippet’: the last 2 words of the segment.
  7. If a paragraph is 100% in one language, return one segment covering the full range from index 0 to the final character.
  8. If the language cannot be determined, use the base_language of the document which is ‘{base_language}’.
  9. Output ONLY valid JSON. No markdown formatting, no preamble, no explanation.

The response is:

{

"paragraphs": [

{{

"ID": "string",

"segments": [

{{

"lang": "iso-code",

"start_index": integer,

"end_index": integer,

"start_snippet": "string",

"end_snippet": "string"

}}

]

}}

]

}

We ask for both a zero-based index and a snippet for the start and end of each language segment. This is because LLMs cannot count very well and so the snippets are used to determine the precise character positions.

Putting it Together

To test the application of these two techniques in document production they were implemented in Fido and exposed in the Word add-in. This is accessed via a button on the ribbon. The language tool provides the facility to review and edit language tags. The techniques discussed here are then used for the AI language detection feature.

The techniques were implemented thus:

Feature Approach
Quickdetection scan usingthe ML/NN technique
Smartdetection scan usingthe ML/NN technique, then scan any paragraphs identified as having more thanone language with LLM technique
Deep scan scan usingthe LLM technique

Performance

Language detection based on sample texts with 30 languages appears to be excellent. The ML/NN technique is fast and quite accurate. The phrase detection and context aware detection with LLM seems to be excellent. However, these are early results and greater test coverage (and validation by native speakers) is required.

Next Steps

  • Widen test coverage with real-life documents.
  • Consider applying the Fido implementation to plain text, EPUB, DT Book, etc.
  • Measure performance with different LLM services to determine differences in accuracy, speed and cost.
  • Consider implementing alternative ML/NN solutions to provide users with more choices.