AI Chatbots That Actually Understand Arabic
What it takes to build Arabic-language AI assistants that handle dialects, retrieval, tone, and business context reliably — not just in demos.

A client launched an Arabic customer support chatbot for their e-commerce platform. The demo was impressive: the assistant answered product questions in formal Modern Standard Arabic, correctly understood complex queries, and cited policy pages accurately. Three weeks into production, the CSAT scores were worse than the old email form. The problem was not the model. It was that real Egyptian shoppers were typing in Egyptian dialect, code-switching between Arabic and English mid-sentence, and asking questions the formal demo corpus had never seen.
Arabic AI that works in the lab is not the same as Arabic AI that works for your users.
Arabic is not one language
Arabic exists on a spectrum from Modern Standard Arabic (MSA) — used in writing, news, and formal contexts — to regional dialects that differ significantly in vocabulary, grammar, and pronunciation. Egyptian Arabic, Gulf Arabic, Levantine Arabic, and Moroccan Darija are not interchangeable. A chatbot tuned on MSA text will produce formal, slightly robotic responses and misunderstand casual dialect input.
Add to this the widespread MENA habit of Arabizi (Arabic written in Latin characters and numbers: "3ayez eh?" instead of "عايز إيه؟") and the heavy use of English technical terms even within Arabic sentences, and you have a vocabulary distribution that most off-the-shelf embedding models were not trained to handle.
Concretely this means:
- Retrieval will fail if your knowledge base uses MSA text but users type in dialect. The semantic distance between query and relevant chunk may be high enough that the right document never surfaces.
- Generation will sound wrong if the model defaults to formal MSA when users expect a conversational, regional tone.
- Intent detection will miss slang, brand names in Arabic script, and mixed-language requests.
Retrieval is more important than prompt engineering for business chatbots
Most business chatbots answer from a specific body of knowledge: return policies, a product catalog, service procedures, medical protocols. The quality of retrieval matters far more than the system prompt.
const retriever = vectorStore.asRetriever({
k: 5,
filter: { locale: "ar", businessId, category: "policy" },
});
const relevantDocs = await retriever.invoke(userQuery);
const response = await model.invoke([
{
role: "system",
content: `أجب بالعربية المناسبة لسياق المستخدم.
استخدم فقط المعلومات المقدمة في السياق.
إذا لم تجد الإجابة، اطرح سؤالاً توضيحياً مفيداً بدلاً من التخمين.`,
},
{
role: "user",
content: `السياق:\n${relevantDocs.map((d) => d.pageContent).join("\n\n")}\n\nالسؤال:\n${userQuery}`,
},
]);
Good retrieval makes the assistant specific and reduces hallucination. The model cannot invent policy details if you tell it to answer only from the provided context.
Embedding model choices for Arabic
Not all embedding models handle Arabic well. English benchmark scores do not predict Arabic retrieval quality. Options worth evaluating:
OpenAI text-embedding-3-large — strong multilingual coverage including MSA and mixed Arabic/English. The default when you are already in the OpenAI ecosystem.
multilingual-e5-large (Microsoft) — open-source, strong Arabic performance across dialects, runs locally. Good when data residency matters or API costs are a constraint.
Cohere embed-multilingual-v3.0 — consistently strong on Arabic retrieval tasks across dialects, available through their API.
Build a small evaluation set of 50–100 question/expected-document pairs that reflect your actual user queries — including dialect, Arabizi, and mixed-language examples — and measure Recall@5 for each model before committing.
Chunking strategy for Arabic text
Arabic has longer sentences than English at similar information density, and its morphological richness means splitting mid-sentence loses important context.
- Chunk by paragraph rather than by character count for prose documents. Paragraphs tend to be semantically coherent units.
- Avoid splitting mid-sentence. Arabic sentences carry essential grammatical information throughout; a fragment of the end without the beginning is meaningless.
- Add overlap of one or two sentences when paragraphs are long, to help when a query concept spans a boundary.
- Include section headers in each chunk. Arabic readers navigate documents by titles; including the header helps the model cite context accurately.
Dialect and tone are product decisions, not technical ones
Deciding how your assistant speaks matters more than which model you pick. An assistant for a Gulf luxury retailer should not sound like an Egyptian call center. A healthcare assistant should be warm but measured. An enterprise B2B assistant may legitimately need formal MSA with English technical terms.
Define target tone with three inputs:
- Market — which country and dialect is most represented in your users?
- Use case — support deflection, sales assistance, appointment booking?
- Brand voice — how does the company speak in its human-written communications?
Evaluate outputs with native speakers from that market, not with automated metrics alone. BLEU scores and cosine similarity tell you nothing about whether the response sounds right to a real user.
Escalation paths are a safety requirement
A business chatbot without escalation is a liability. Define clearly:
- When to transfer to a human — complaints, sensitive topics, repeated failed attempts.
- When to ask for clarification instead of guessing.
- What the assistant cannot answer — medical diagnosis, legal conclusions, financial commitments beyond its scope.
async function routeMessage(message: string, response: AssistantResponse) {
const sensitiveTopics = ["complaint", "refund", "medical", "legal", "شكوى", "استرداد"];
const isSensitive = sensitiveTopics.some((t) =>
message.toLowerCase().includes(t),
);
if (response.confidence < 0.7 || isSensitive) {
return { action: "escalate", reason: "low_confidence_or_sensitive_topic" };
}
return { action: "respond", content: response.text };
}
For high-stakes domains — healthcare, finance, legal — a guided assistant that narrows choices is safer than an open-ended question-answering machine.
Evaluation that reflects real users
Arabic AI quality degrades in non-obvious ways. What looks fine in English A/B testing may be noticeably worse in Arabic. Build an evaluation set from real production conversations (anonymized), covering:
- Dialect variations of the same question.
- Code-switched queries: "عايز اعرف الـ delivery fee بتاعت order رقم 5580".
- Questions with typos and informal spelling.
- Questions the chatbot should refuse or escalate.
Run this evaluation on every model and prompt change before shipping to production. Subjective native-speaker review is not optional — automated metrics alone will not catch tone and dialect failures.
Arabic AI products succeed when retrieval is grounded in real user language, tone matches the market, escalation is built in from the start, and evaluation is done by people who speak the language your users actually use.
For the broader challenge of building bilingual interfaces that house these assistants, see Building Bilingual Next.js Apps.