After typesetting a book using Xe(La)TeX, I was asked why some "character pairs get squeezed into strange binary single characters in the PDF" -- i.e., ligatures for "fi", etc. After explaining, I was asked if it was possible to have the ligatures appear in the "visual layer" and the individual component letters in the "text layer" (I didn't even know there was such a thing!), because the ligatures interfere with searchability.
Anybody have a solution?
XeTeX ⇒ Ligatures and search
-
- Site Moderator
- Posts: 814
- Joined: Tue Jul 01, 2008 2:19 pm
Re: Ligatures and search
This should usually "just happen". Can you provide an example where you are seeing an issue.
Joseph Wright
Re: Ligatures and search
From my testing: evince recognizes ff/fi/fl/ffi/ffl ligatures, but no others as far as I can tell (the other common one being Th). xpdf doesn't recognize any of them. I haven't tried Adobe software, but the publisher tells me it's a problem.
-
- Site Moderator
- Posts: 814
- Joined: Tue Jul 01, 2008 2:19 pm
Re: Ligatures and search
N, I meant can you provide the TeX source for something showing the issue, so we can test it ourselves.
Joseph Wright
Ligatures and search
Sure; but anything that uses the "problem" characters will do.
Code: Select all
\documentclass[a4paper]{article}
\usepackage{xltxtra}
\setmainfont[Mapping=tex-text]{Adobe Garamond Pro}
\begin{document}
This is a problem.
\end{document}
Ligatures and search
This seems to be very font specific. I tested with this:
Evince could "see through" (when searching) the ffl and ffi ligatures, but not the Th or fi or fl or fj ligatures.
I then switched to Sorts Mill Goudy and evince couldn't see any of the ligatures.
(I chose Sorts Mill Goudy since it's one of the few freely available opentype fonts out there that has ligatures. I figured some people reading this might want to test for themselves even if they don't have access to the expensive Pro fonts. It does not have a Th-ligature though.)
(P.S. Testing some more, Okular seems to have more problems than evince, so this is viewer-specific too.
)
Code: Select all
\documentclass{article}
\usepackage{fontspec}
\setmainfont[Mapping=text-tex]{Adobe Caslon Pro}
\begin{document}
The affluent affinity for fishing off the flank of the fjord \ldots
\end{document}
I then switched to Sorts Mill Goudy and evince couldn't see any of the ligatures.
(I chose Sorts Mill Goudy since it's one of the few freely available opentype fonts out there that has ligatures. I figured some people reading this might want to test for themselves even if they don't have access to the expensive Pro fonts. It does not have a Th-ligature though.)
(P.S. Testing some more, Okular seems to have more problems than evince, so this is viewer-specific too.

Last edited by frabjous on Fri May 07, 2010 3:14 pm, edited 1 time in total.
Re: Ligatures and search
Also, if you have old-style digits, you can't search for "normal" digits.
Re: Ligatures and search
OK, so what's happening is that when the ligature exists in Unicode (that is: IJ, ij, LJ, lj, ff, fi, fl, ffi, ffl, st -- but not OE or oe), everything works OK. Similarly for the Unicode double-letters (somehow distinguished from ligatures by Unicode naming) LJ, Lj, lj, NJ, Nj, nj, DZ, Dz, dz, DŽ, Dž, dž -- but, again, not AE or ae. All other printed ligatures are "broken" because they're at some private-use codepoint lacking Unicode decomposition data -- thus the same problem occurs with alternate forms.
Is it possible to invisibly associate the components with the ligated characters in the same way OCR text can be associated with scanned documents?
Is it possible to invisibly associate the components with the ligated characters in the same way OCR text can be associated with scanned documents?