[Poppler-bugs] [Bug 21590] New: Unchecked code space ranges cause excessive memory allocations

Wed May 6 03:05:29 PDT 2009

http://bugs.freedesktop.org/show_bug.cgi?id=21590

           Summary: Unchecked code space ranges cause excessive memory
                    allocations
           Product: poppler
           Version: unspecified
          Platform: x86-64 (AMD64)
        OS/Version: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: nick.jones at network-box.com

In a corrupted or improperly generated pdf document, code space range boundary
values can contain too many digits and thus describe ranges that require
excessive numbers of CMapVectorEntries arrays to simply describe them.

Seeing some other implementations of pdf readers limit the mapping hierarchy
and lookup functions to two byte (four hex digit) representations of range
boundaries, and in much documentation of pdf internals only mention one and two
byte ranges, I felt it would make sense to check that the length of the
boundary values are less than or equal to four hex digits.

hexidecimal numbers from the pdf document are consumed using sscanf and stored
in unsigned ints. Two hex numbers of six digits have the potential to cause
huge allocations, and two hex numbers of eight digits will usually OOM the
process. (but see the first note below)

The attached patch contains a simple additional validation and common sense
check.

Note 1:
The PDF spec states that the numbers used to represent code space range
boundaries must be representable by an Integer, defined in Appendix C of the
same specification.  The range of this Integer type is -2^32 -> 2^31 - 1. 
Assuming that negative values are not allowed, valid values for the boundary
should be: <00000000> -> <7fffffff>

This represents a lookup hierarchy of four levels, which is valid according to
the pdf specification, and could cause poppler to allocate huge amounts of
memory.

Note 2:
While playing around with code in CMap.cc, I found the addCodeSpace function a
little unclear.  Also, using sscanf to parse hex numbers had the potential to
overflow the Guint type if the hex number had more than eight digits.  I came
up with a revised version of this function in case you are interested,
something like:
----
void addCodeSpace2(CMapVectorEntry *vec, char* tok1, char* tok2)
  {
  if (strlen(tok1) > 2)
    {
    unsigned int start, end = 0;

    char startByte[] = {tok1[0], tok1[1], '\0'};
    char endByte[] = {tok2[0], tok2[1], '\0'};

    sscanf(startByte, "%x", &start);
    sscanf(endByte, "%x", &end);

    for (i = (start <= end) ? start : end;
         i <= (start > end) ? start : end; ++i) {
      {
      if (!vec[i].isVector)
        {
        vec[i].isVector = true;
        vec[i].vector = new CMapVectorEntry[256];
        }
      addCodeSpace2(vec[i].vector,  tok1 + 2, tok2 + 2);
      }
    }
  }
----
    tok1[n1 - 1] = tok2[n1 - 1] = '\0';

    addCodeSpace2(cmap->vector, tok1 + 1, tok2 + 1);
----

note3:
The corrupted pdf document had a codespacerange definition that looked like:
<81308130> <FE39FE39>
The repition of the digits smacks of a bug in the pdf generator software.  The
document seemed to be a scan of a printed document with a handwritten
signature, output as pdf. The metadata in the document simply mentioned: Canon

-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.