[poppler] How to make text extracted from tables more readable

jose.aliste at gmail.com jose.aliste at gmail.com
Tue Dec 3 02:43:33 PST 2013

Hi, I don't think what you want to do is easy nor feasible... Poppler uses
an heuristic and when using layout he tries to preserve the physical
layout. While poppler does detect columns,  I don't think it detects tables
(and all the code for that is in the TextOutputDev.cc file). That being
said, if i run your file through my pdftotext (0.24.3, dont know if that
matters) then I DO get the arrows, so you could use the arrows to
postprocess your file and add the lines you want.



On Mon, Dec 2, 2013 at 12:21 PM, Nishanth Lawrence <r.nishanth.cse at gmail.com
> wrote:

> Hi ,
> Sorry my previous mail was not formatted correctly due to tables , so I
> have given links to google docs .
> I am using pdftotext version 0.24.2 . Following is my case
> https://drive.google.com/file/d/0Bwj-LRZNYWXvTXVZNHNyQnNNd00/edit?usp=sharing
> While extracting using the following command line utility
> pdftotext table.pdf table.txt  -layout -nopgbrk -q
> I am getting the following output
> https://docs.google.com/file/d/0Bwj-LRZNYWXvSGdwa2FXemtydDQ/edit
> So what I want is ,  if there in no bullet in any of the line then there
> should be empty line in opposite column , could you please tell me what to
> change in the code so that I could get an output similar to this
> https://docs.google.com/file/d/0Bwj-LRZNYWXvck9jMmQtWFU1VkU/edit
> Or at least  which part of the code has to be modified to achieve the
> above .
> Thanks in advance
> --
> With Regards
> Nishanth R Lawrence
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20131203/4704d1e9/attachment.html>

More information about the poppler mailing list