XeTeXLigatures and search

Information and discussion about XeTeX, an alternative for pdfTeX based on e-Tex
Post Reply
paul
Posts: 49
Joined: Thu Apr 08, 2010 5:56 am

Ligatures and search

Post by paul »

After typesetting a book using Xe(La)TeX, I was asked why some "character pairs get squeezed into strange binary single characters in the PDF" -- i.e., ligatures for "fi", etc. After explaining, I was asked if it was possible to have the ligatures appear in the "visual layer" and the individual component letters in the "text layer" (I didn't even know there was such a thing!), because the ligatures interfere with searchability.

Anybody have a solution?

Recommended reading 2024:

LaTeXguide.org • LaTeX-Cookbook.net • TikZ.org
LaTeX Beginner's Guide LaTeX Cookbook LaTeX TikZ graphics TikZによるLaTeXグラフィックス
josephwright
Site Moderator
Posts: 814
Joined: Tue Jul 01, 2008 2:19 pm

Re: Ligatures and search

Post by josephwright »

This should usually "just happen". Can you provide an example where you are seeing an issue.
Joseph Wright
paul
Posts: 49
Joined: Thu Apr 08, 2010 5:56 am

Re: Ligatures and search

Post by paul »

From my testing: evince recognizes ff/fi/fl/ffi/ffl ligatures, but no others as far as I can tell (the other common one being Th). xpdf doesn't recognize any of them. I haven't tried Adobe software, but the publisher tells me it's a problem.
josephwright
Site Moderator
Posts: 814
Joined: Tue Jul 01, 2008 2:19 pm

Re: Ligatures and search

Post by josephwright »

N, I meant can you provide the TeX source for something showing the issue, so we can test it ourselves.
Joseph Wright
paul
Posts: 49
Joined: Thu Apr 08, 2010 5:56 am

Ligatures and search

Post by paul »

Sure; but anything that uses the "problem" characters will do.

Code: Select all

\documentclass[a4paper]{article}
\usepackage{xltxtra}
\setmainfont[Mapping=tex-text]{Adobe Garamond Pro}

\begin{document}
This is a problem.
\end{document}
User avatar
frabjous
Posts: 2064
Joined: Fri Mar 06, 2009 12:20 am

Ligatures and search

Post by frabjous »

This seems to be very font specific. I tested with this:

Code: Select all

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Mapping=text-tex]{Adobe Caslon Pro}
\begin{document}
The affluent affinity for fishing off the flank of the fjord \ldots
\end{document}
Evince could "see through" (when searching) the ffl and ffi ligatures, but not the Th or fi or fl or fj ligatures.

I then switched to Sorts Mill Goudy and evince couldn't see any of the ligatures.

(I chose Sorts Mill Goudy since it's one of the few freely available opentype fonts out there that has ligatures. I figured some people reading this might want to test for themselves even if they don't have access to the expensive Pro fonts. It does not have a Th-ligature though.)

(P.S. Testing some more, Okular seems to have more problems than evince, so this is viewer-specific too. :? )
Last edited by frabjous on Fri May 07, 2010 3:14 pm, edited 1 time in total.
paul
Posts: 49
Joined: Thu Apr 08, 2010 5:56 am

Re: Ligatures and search

Post by paul »

Also, if you have old-style digits, you can't search for "normal" digits.
paul
Posts: 49
Joined: Thu Apr 08, 2010 5:56 am

Re: Ligatures and search

Post by paul »

OK, so what's happening is that when the ligature exists in Unicode (that is: IJ, ij, LJ, lj, ff, fi, fl, ffi, ffl, st -- but not OE or oe), everything works OK. Similarly for the Unicode double-letters (somehow distinguished from ligatures by Unicode naming) LJ, Lj, lj, NJ, Nj, nj, DZ, Dz, dz, DŽ, Dž, dž -- but, again, not AE or ae. All other printed ligatures are "broken" because they're at some private-use codepoint lacking Unicode decomposition data -- thus the same problem occurs with alternate forms.

Is it possible to invisibly associate the components with the ligated characters in the same way OCR text can be associated with scanned documents?
Post Reply