Oxford University Press Text Capture Instructions

 

Non-Latin based languages

Capture non-Latin based languages (e.g. Greek, Hebrew, Russian, Chinese, Arabic, etc.) using an xml:lang attribute where the value is the ISO 639-2 language code.

For phrases, sentences, or single characters embedded within a Latin-based text, please wrap the non-Latin language in a span element with the xml:lang attribute.

If there are Greek characters that form all or part of a mathematical expression that has been identified using the MathType/Equation editor, equation typecodes, or manifest then use MathML markup in these cases instead.

For entire paragraphs or sections, the xml:lang attribute may go inside p, div, or, in the case of legal extracts, extract (see section Extracts of legal documents).

Take care when capturing Greek capital letters. Several Greek capital letters look identical to Latin letters.

For example, Latin A (U+0041) looks like Greek Α (U+0391, capital letter alpha)

With alphabets which are read right-to-left (e.g. Hebrew), the characters must be captured in the XML file in the order in which they are meant to be read. (In a text editor such as Oxygen, the characters will be rendered in the correct right-to-left order.)


<p>The most obvious interpretation of Genesis 1.1 from Hebrew (<span xml:lang="heb">בְּרֶאשּית
בָּרָא אֱלֹהים</span>) to Greek (<span xml:lang="ell">βιβλος γενεσεως ιησου χριστου υιου
δαβιδ υιου αβρααμ</span>) makes the case perfectly clear.</p>
Release ID:
20261202
ID:
OUP_Structured_Text_TCI_topic_2_5_1_2
Author:
dunnm
Last changed:
Wed, 04 Jun 2025
Modified by:
buckmasm
Revision#:
4400