diff --git a/FULL_BENCHMARKS.md b/FULL_BENCHMARKS.md
new file mode 100644
index 0000000..a220015
--- /dev/null
+++ b/FULL_BENCHMARKS.md
@@ -0,0 +1,105 @@
+# Full 90-Language Benchmark
+
+This is a comprehensive multilingual evaluation covering 90 languages, comparing Chandra 2 against Gemini 2.5 Flash. The average scores are lower than the [43-language benchmark](README.md#multilingual-benchmark-table) because this includes many lower-resource languages.
+
+## Overall Scores
+
+| | Chandra 2 | Gemini 2.5 Flash |
+|---|:---:|:---:|
+| **Average** | **72.7% +/- 1.2%** | **60.8% +/- 1.3%** |
+
+## Results by Language
+
+| Language | Chandra 2 | Gemini 2.5 Flash |
+|----------|:--------:|:----------------:|
+| af | 80.4% | 85.8% |
+| am | 34.4% | 0.5% |
+| ar | 68.4% | 84.4% |
+| as | 35.8% | 23.1% |
+| az | 75.2% | 74.0% |
+| be | 80.7% | 66.4% |
+| bg | 83.1% | 64.3% |
+| bn | 72.8% | 55.3% |
+| br | 90.0% | 69.4% |
+| bs | 84.8% | 85.1% |
+| ca | 85.1% | 88.0% |
+| cs | 85.3% | 79.1% |
+| cy | 82.2% | 77.6% |
+| da | 91.1% | 86.0% |
+| de | 94.8% | 88.3% |
+| el | 85.6% | 83.5% |
+| en | 96.6% | 90.3% |
+| eo | 80.1% | 71.9% |
+| es | 89.3% | 86.8% |
+| et | 75.2% | 73.7% |
+| eu | 80.2% | 74.6% |
+| fa | 75.1% | 61.8% |
+| fi | 83.4% | 86.0% |
+| fr | 93.7% | 86.1% |
+| fy | 81.2% | 70.1% |
+| ga | 80.9% | 70.1% |
+| gd | 71.8% | 59.5% |
+| gl | 80.9% | 80.9% |
+| gu | 70.8% | 47.6% |
+| ha | 72.1% | 59.1% |
+| he | 70.4% | 50.9% |
+| hi | 78.4% | 82.7% |
+| hr | 90.1% | 88.2% |
+| hu | 82.1% | 84.5% |
+| hy | 64.2% | 42.1% |
+| id | 91.6% | 88.3% |
+| is | 77.3% | 72.2% |
+| it | 94.6% | 85.7% |
+| ja | 86.9% | 80.0% |
+| jv | 73.2% | 80.4% |
+| ka | 77.0% | 39.3% |
+| kk | 80.5% | 77.2% |
+| km | 46.1% | 6.3% |
+| kn | 63.2% | 24.5% |
+| ko | 81.5% | 84.8% |
+| ku | 62.0% | 63.2% |
+| ky | 81.2% | 69.8% |
+| la | 73.8% | 70.5% |
+| lo | 60.9% | 13.3% |
+| lt | 79.8% | 70.5% |
+| lv | 76.9% | 81.5% |
+| mg | 81.2% | 78.4% |
+| mk | 83.5% | 77.4% |
+| ml | 64.3% | 23.8% |
+| mn | 88.4% | 71.4% |
+| mr | 75.0% | 69.7% |
+| ms | 79.3% | 79.8% |
+| my | 55.9% | 15.8% |
+| ne | 45.3% | 43.0% |
+| nl | 88.6% | 87.5% |
+| no | 90.5% | 87.8% |
+| or | 31.1% | 11.2% |
+| pa | 48.3% | 22.4% |
+| pl | 91.5% | 91.1% |
+| ps | 12.6% | 13.3% |
+| pt | 95.2% | 89.4% |
+| ro | 84.5% | 76.7% |
+| ru | 85.5% | 82.8% |
+| sa | 51.1% | 44.6% |
+| sd | 50.0% | 29.3% |
+| si | 62.4% | 26.2% |
+| sk | 77.3% | 81.2% |
+| sl | 81.0% | 80.1% |
+| so | 82.4% | 69.9% |
+| sq | 75.3% | 77.1% |
+| sr | 90.3% | 89.7% |
+| su | 85.7% | 96.4% |
+| sv | 93.3% | 91.1% |
+| sw | 88.9% | 80.9% |
+| ta | 77.7% | 53.9% |
+| te | 58.6% | 33.3% |
+| th | 62.6% | 66.7% |
+| tr | 84.1% | 84.1% |
+| ug | 25.8% | 5.4% |
+| uk | 91.0% | 87.9% |
+| ur | 44.1% | 57.6% |
+| uz | 77.2% | 52.8% |
+| vi | 82.6% | 89.5% |
+| xh | 82.1% | 62.1% |
+| yi | 24.9% | 6.8% |
+| zh | 88.7% | 70.0% |
diff --git a/README.md b/README.md
index 35915b1..8fd23a8 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,7 @@ Multilingual performance was a focus for us with Chandra 2. There isn't a good
-See full scores [below](#multilingual-benchmark-table).
+See full scores [below](#multilingual-benchmark-table). We also have a [full 90-language benchmark](FULL_BENCHMARKS.md).
We also benchmarked Chandra 2 with the widely accepted olmocr benchmark:
@@ -144,7 +144,7 @@ chandra ./documents ./output --method hf
- `--max-workers INTEGER`: Parallel workers for vLLM
- `--include-images/--no-images`: Extract and save images (default: include)
- `--include-headers-footers/--no-headers-footers`: Include page headers/footers (default: exclude)
-- `--batch-size INTEGER`: Pages per batch (default: 1)
+- `--batch-size INTEGER`: Pages per batch (default: 28 for vllm, 1 for hf)
**Output Structure:**
@@ -152,7 +152,7 @@ Each processed file creates a subdirectory with:
- `.md` - Markdown output
- `.html` - HTML output
- `_metadata.json` - Metadata (page info, token count, etc.)
-- `images/` - Extracted images from the document
+- Extracted images are saved directly in the output directory
### Streamlit Web App
@@ -176,7 +176,7 @@ This launches a Docker container with optimized inference settings. Configure vi
- `VLLM_MODEL_NAME`: Model name for the server (default: `chandra`)
- `VLLM_GPUS`: GPU device IDs (default: `0`)
-You can also start your own vllm server with the `datalab-to/chandra` model.
+You can also start your own vllm server with the `datalab-to/chandra-ocr-2` model.
### Configuration
@@ -184,7 +184,7 @@ Settings can be configured via environment variables or a `local.env` file:
```bash
# Model settings
-MODEL_CHECKPOINT=datalab-to/chandra
+MODEL_CHECKPOINT=datalab-to/chandra-ocr-2
MAX_OUTPUT_TOKENS=8192
# vLLM settings
@@ -218,6 +218,8 @@ This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license
# Multilingual benchmark table
+The table below covers the 43 most common languages, benchmarked across multiple models. For a comprehensive evaluation across 90 languages (Chandra 2 vs Gemini 2.5 Flash only), see the [full 90-language benchmark](#full-90-language-benchmark-table).
+
| Language | Datalab API | Chandra 2 | Chandra 1 | Gemini 2.5 Flash | GPT-5 Mini |
|---|:---:|:---:|:---:|:---:|:---:|
| ar | 67.6% | 68.4% | 34.0% | 84.4% | 55.6% |
@@ -264,11 +266,17 @@ This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license
| zh | 87.8% | 88.7% | 88.3% | 70.0% | 70.4% |
| **Average** | **80.4%** | **77.8%** | **69.4%** | **67.6%** | **60.5%** |
+# Full 90-language benchmark table
+
+We also have a more comprehensive evaluation covering 90 languages, comparing Chandra 2 against Gemini 2.5 Flash. The average scores are lower than the 43-language table above because this includes many lower-resource languages. Chandra 2 averages **72.7%** vs Gemini 2.5 Flash at **60.8%**.
+
+See the [full 90-language results](FULL_BENCHMARKS.md).
+
# Credits
Thank you to the following open source projects:
- [Huggingface Transformers](https://github.com/huggingface/transformers)
- [VLLM](https://github.com/vllm-project/vllm)
-- [olmocr](github.com/allenai/olmocr)
-- [Qwen 3 VL](https://github.com/QwenLM/Qwen3)
\ No newline at end of file
+- [olmocr](https://github.com/allenai/olmocr)
+- [Qwen 3.5](https://github.com/QwenLM/Qwen3)
\ No newline at end of file
diff --git a/chandra/model/hf.py b/chandra/model/hf.py
index 3439a28..3f4ebf0 100644
--- a/chandra/model/hf.py
+++ b/chandra/model/hf.py
@@ -16,7 +16,7 @@ def generate_hf(
if max_output_tokens is None:
max_output_tokens = settings.MAX_OUTPUT_TOKENS
- conversations = [[process_batch_element(item, bbox_scale)] for item in batch]
+ conversations = [[process_batch_element(item)] for item in batch]
inputs = model.processor.apply_chat_template(
conversations,
@@ -45,12 +45,12 @@ def generate_hf(
return results
-def process_batch_element(item: BatchInputItem, bbox_scale: int):
+def process_batch_element(item: BatchInputItem):
prompt = item.prompt
prompt_type = item.prompt_type
if not prompt:
- prompt = PROMPT_MAPPING[prompt_type].replace("{bbox_scale}", str(bbox_scale))
+ prompt = PROMPT_MAPPING[prompt_type]
content = []
image = scale_to_fit(item.image) # Guarantee max size
diff --git a/chandra/model/vllm.py b/chandra/model/vllm.py
index d640adb..94ad975 100644
--- a/chandra/model/vllm.py
+++ b/chandra/model/vllm.py
@@ -56,9 +56,7 @@ def generate_vllm(
def _generate(item: BatchInputItem, temperature, top_p) -> GenerationResult:
prompt = item.prompt
if not prompt:
- prompt = PROMPT_MAPPING[item.prompt_type].replace(
- "{bbox_scale}", str(bbox_scale)
- )
+ prompt = PROMPT_MAPPING[item.prompt_type]
content = []
image = scale_to_fit(item.image)
diff --git a/chandra/output.py b/chandra/output.py
index 3fb3283..030a4fd 100644
--- a/chandra/output.py
+++ b/chandra/output.py
@@ -27,7 +27,7 @@ def extract_images(html: str, chunks: dict, image: Image.Image):
for idx, chunk in enumerate(chunks):
div_idx += 1
if chunk["label"] in ["Image", "Figure"]:
- img = chunk["content"].find("img")
+ img = BeautifulSoup(chunk["content"], "html.parser").find("img")
if not img:
continue
bbox = chunk["bbox"]