Automated PDF Translation Pipeline

An intelligent, two-pass workflow using the Gemini LLM to ensure consistent, high-quality translation of large documents.

The Challenge: Translation Inconsistency

Standard LLM calls fail to translate key terms consistently across a large document. Lacking memory of previous translations, LLMs produce a disjointed final text.

The Two-Pass Solution

This pipeline solves the problem by intelligently preparing the LLM with context before the final translation.

PHASE 1

Discovery & Glossary Creation

The first pass analyzes the entire document with the Gemini to identify and extract key terms. It builds a comprehensive translation glossary.

Output: A comprehensive glossary file, built and refined as the script processes the book.

PHASE 2

Consistent Final Translation

The second pass re-translates the whole document. For every block, it provides the entire glossary to Gemini, forcing the model to use the pre-approved translated terms.

Output: A final translated file with consistent translations.

End-to-End Data Pipeline

1. Extract

PDF to JSON

2. Translate (Two-Pass)

Generates Glossary & Final Text

3. Load

JSON to Firestore