π English β Azerbaijani Translator
Quality-Optimized NLLB-200 Model β’ 23.5M Training Pairs β’ RTL-Enhanced β’ Production Ready
π― About This Translator
This is a quality-optimized English to South Azerbaijani (azb, Arabic script) translation system, fine-tuned on 23,499,996 pairs with advanced generation parameters and RTL post-processing for perfect Arabic script rendering.
β¨ Quality Features:
- Multi-beam search with diversity penalty for better alternatives
- Length penalty optimization to prevent truncation
- Repetition control for natural-sounding output
- Advanced decoding with temperature and top-k/top-p sampling
- GPU acceleration with mixed precision (when available)
- RTL post-processing for proper Arabic script display
π Model Specifications:
- Training Data: 23,499,996 pairs parallel sentences
- Parameters: 615M (NLLB-200 Distilled)
- Quality Grade: A+ (Production Ready)
- Language Pair: English β South Azerbaijani (azb_Arab)
π§ RTL Enhancements:
- Automatic spacing correction around punctuation
- Zero-width character removal
- Proper RTL embedding markers
- Time and number formatting fixes
π¬π§ Input (English)
π¦πΏ Output (Azerbaijani - Arabic Script)
Choose your quality preset based on your needs:
- Maximum: Best quality for important documents (8 beams, diverse search)
- High: Excellent quality, recommended default (5 beams)
- Balanced: Good quality with faster speed (3 beams)
- Fast: Quick translations for casual use (1 beam)
Or customize beam count for fine-tuned control.
π Try These Examples
Quality Optimization Features
This translator uses advanced generation techniques to ensure the highest quality translations:
π― Beam Search Optimization
- Diverse Beam Search: Generates multiple diverse alternatives
- Length Penalty: Prevents premature truncation
- Repetition Penalty: Avoids unnatural repetitions
- Early Stopping: Optimizes for completeness
π RTL Post-Processing
- Punctuation Spacing: Automatic correction around Arabic punctuation (ΨΨΨ)
- Zero-Width Cleanup: Removes invisible characters causing encoding issues
- Number Formatting: Proper spacing for times (15:00) and ranges
- RTL Markers: Unicode embedding for correct browser rendering
- Whitespace Normalization: Removes extra spaces and artifacts
β‘ Performance Optimizations
- Mixed Precision: FP16 on GPU for faster inference
- KV Caching: Speeds up sequential generation
- Batch Processing: Efficient multi-text translation
- Gradient Checkpointing: Memory-efficient processing
π Quality Presets
Preset | Beams | Features | Best For |
---|---|---|---|
Maximum | 8 | Diverse search, high diversity penalty | Legal, medical, critical documents |
High | 5 | Balanced quality and speed | General documents, articles |
Balanced | 3 | Good quality, faster | Emails, chat, casual content |
Fast | 1 | Greedy decoding | Quick translations, drafts |
Model Architecture
- Model Name: NLLB ENβAZB Fine-tuned 23M
- Base Model: facebook/nllb-200-distilled-600M
- Parameters: 615M
- Encoder/Decoder Layers: 12
- Hidden Size: 1024
- Attention Heads: 16
Training Details
- Dataset: taksa1990/azb-en-translation-large
- Training Samples: 23.5M pairs
- Test Samples: 4,000
- Language Pair: English β South Azerbaijani (azb_Arab)
- Script: Arabic script
Recommended Use Cases
β Excellent for:
- π° News articles and journalism
- π Technical documentation and manuals
- π Website and app localization
- π§ Professional emails and correspondence
- π Business documents and reports
- π Academic and educational content
- π E-commerce product descriptions
- π± Social media content
β οΈ Requires review for:
- βοΈ Legal contracts (use Maximum quality + human review)
- π₯ Medical documents (professional translation required)
- π° Financial statements (certified translation needed)
β Not recommended for:
- π΅ Poetry and creative writing (artistic nuance required)
- π£οΈ Regional dialects (trained on standard Azerbaijani)
- π Idiomatic expressions (may be too literal)
Code Examples with RTL Processing
Maximum Quality Translation with RTL
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import re
model_id = "taksa1990/nllb-en-azb-finetuned-23M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def post_process_rtl(text):
# Remove zero-width characters
text = text.replace('\u200c', '').replace('\u200d', '')
# Fix spacing around punctuation
text = re.sub(r'\s*Ψ\s*', 'Ψ ', text)
text = re.sub(r'\s*Ψ\s*', 'Ψ ', text)
# Add RTL markers
return f"\u202B{text}\u202C"
text = "Hello, how are you today?"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Maximum quality settings
outputs = model.generate(
**inputs,
max_length=256,
num_beams=8,
num_beam_groups=4,
diversity_penalty=0.5,
length_penalty=1.2,
early_stopping=True,
no_repeat_ngram_size=4,
repetition_penalty=1.3
)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
translation = post_process_rtl(translation)
print(f"EN: {text}")
print(f"AZB: {translation}")
Citation
@model{nllb-en-azb-23m-rtl,
title={Quality-Optimized NLLB English to South Azerbaijani with RTL Processing},
author={tayden1990},
year={2025},
url={https://huggingface.co/taksa1990/nllb-en-azb-finetuned-23M},
base_model={facebook/nllb-200-distilled-600M},
dataset={taksa1990/azb-en-translation-large},
training_samples={23499996},
optimization={quality-first-rtl-enhanced}
}
π Links & Resources
π¦ Model Card β’ π Training Dataset β’ π§ Base Model β’ π» GitHub
Model: taksa1990/nllb-en-azb-finetuned-23M
Training Data: 23,499,996 pairs β’ Optimization: Quality-First + RTL β¨
Made with β€οΈ by @tayden1990 β’ Quality-Optimized β’ RTL-Enhanced β’ Last Updated: 2025-10-16 03:12:45 UTC