🌍 English β†’ Azerbaijani Translator

Quality-Optimized NLLB-200 Model β€’ 23.5M Training Pairs β€’ RTL-Enhanced β€’ Production Ready

✨ Maximum Quality Mode πŸš€ 23M Dataset ⚑ GPU Accelerated πŸ“ RTL Optimized

🎯 About This Translator

This is a quality-optimized English to South Azerbaijani (azb, Arabic script) translation system, fine-tuned on 23,499,996 pairs with advanced generation parameters and RTL post-processing for perfect Arabic script rendering.

✨ Quality Features:

  • Multi-beam search with diversity penalty for better alternatives
  • Length penalty optimization to prevent truncation
  • Repetition control for natural-sounding output
  • Advanced decoding with temperature and top-k/top-p sampling
  • GPU acceleration with mixed precision (when available)
  • RTL post-processing for proper Arabic script display

πŸ“Š Model Specifications:

  • Training Data: 23,499,996 pairs parallel sentences
  • Parameters: 615M (NLLB-200 Distilled)
  • Quality Grade: A+ (Production Ready)
  • Language Pair: English β†’ South Azerbaijani (azb_Arab)

πŸ”§ RTL Enhancements:

  • Automatic spacing correction around punctuation
  • Zero-width character removal
  • Proper RTL embedding markers
  • Time and number formatting fixes

πŸ‡¬πŸ‡§ Input (English)

πŸ‡¦πŸ‡Ώ Output (Azerbaijani - Arabic Script)

Choose your quality preset based on your needs:

  • Maximum: Best quality for important documents (8 beams, diverse search)
  • High: Excellent quality, recommended default (5 beams)
  • Balanced: Good quality with faster speed (3 beams)
  • Fast: Quick translations for casual use (1 beam)

Or customize beam count for fine-tuned control.

🎯 Quality Preset

Higher quality = slower but better translations

50 512
0 12

πŸ“ Try These Examples

Click any example to translate

Quality Optimization Features

This translator uses advanced generation techniques to ensure the highest quality translations:

🎯 Beam Search Optimization

  • Diverse Beam Search: Generates multiple diverse alternatives
  • Length Penalty: Prevents premature truncation
  • Repetition Penalty: Avoids unnatural repetitions
  • Early Stopping: Optimizes for completeness

πŸ“ RTL Post-Processing

  • Punctuation Spacing: Automatic correction around Arabic punctuation (ΨŒΨ›ΨŸ)
  • Zero-Width Cleanup: Removes invisible characters causing encoding issues
  • Number Formatting: Proper spacing for times (15:00) and ranges
  • RTL Markers: Unicode embedding for correct browser rendering
  • Whitespace Normalization: Removes extra spaces and artifacts

⚑ Performance Optimizations

  • Mixed Precision: FP16 on GPU for faster inference
  • KV Caching: Speeds up sequential generation
  • Batch Processing: Efficient multi-text translation
  • Gradient Checkpointing: Memory-efficient processing

πŸ“Š Quality Presets

Preset Beams Features Best For
Maximum 8 Diverse search, high diversity penalty Legal, medical, critical documents
High 5 Balanced quality and speed General documents, articles
Balanced 3 Good quality, faster Emails, chat, casual content
Fast 1 Greedy decoding Quick translations, drafts

Model Architecture

  • Model Name: NLLB ENβ†’AZB Fine-tuned 23M
  • Base Model: facebook/nllb-200-distilled-600M
  • Parameters: 615M
  • Encoder/Decoder Layers: 12
  • Hidden Size: 1024
  • Attention Heads: 16

Training Details

  • Dataset: taksa1990/azb-en-translation-large
  • Training Samples: 23.5M pairs
  • Test Samples: 4,000
  • Language Pair: English β†’ South Azerbaijani (azb_Arab)
  • Script: Arabic script

Recommended Use Cases

βœ… Excellent for:

  • πŸ“° News articles and journalism
  • πŸ“š Technical documentation and manuals
  • 🌐 Website and app localization
  • πŸ“§ Professional emails and correspondence
  • πŸ“„ Business documents and reports
  • πŸŽ“ Academic and educational content
  • πŸ›’ E-commerce product descriptions
  • πŸ“± Social media content

⚠️ Requires review for:

  • βš–οΈ Legal contracts (use Maximum quality + human review)
  • πŸ₯ Medical documents (professional translation required)
  • πŸ’° Financial statements (certified translation needed)

❌ Not recommended for:

  • 🎡 Poetry and creative writing (artistic nuance required)
  • πŸ—£οΈ Regional dialects (trained on standard Azerbaijani)
  • 🎭 Idiomatic expressions (may be too literal)

Code Examples with RTL Processing

Maximum Quality Translation with RTL

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import re

model_id = "taksa1990/nllb-en-azb-finetuned-23M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

def post_process_rtl(text):
    # Remove zero-width characters
    text = text.replace('\u200c', '').replace('\u200d', '')
    # Fix spacing around punctuation
    text = re.sub(r'\s*،\s*', '، ', text)
    text = re.sub(r'\s*؟\s*', '؟ ', text)
    # Add RTL markers
    return f"\u202B{text}\u202C"

text = "Hello, how are you today?"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Maximum quality settings
outputs = model.generate(
    **inputs,
    max_length=256,
    num_beams=8,
    num_beam_groups=4,
    diversity_penalty=0.5,
    length_penalty=1.2,
    early_stopping=True,
    no_repeat_ngram_size=4,
    repetition_penalty=1.3
)

translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
translation = post_process_rtl(translation)

print(f"EN: {text}")
print(f"AZB: {translation}")

Citation

@model{nllb-en-azb-23m-rtl,
  title={Quality-Optimized NLLB English to South Azerbaijani with RTL Processing},
  author={tayden1990},
  year={2025},
  url={https://huggingface.co/taksa1990/nllb-en-azb-finetuned-23M},
  base_model={facebook/nllb-200-distilled-600M},
  dataset={taksa1990/azb-en-translation-large},
  training_samples={23499996},
  optimization={quality-first-rtl-enhanced}
}