[poppler] How to make text extracted from tables more readable

jose.aliste at gmail.com jose.aliste at gmail.com
Tue Dec 3 02:43:33 PST 2013


Hi, I don't think what you want to do is easy nor feasible... Poppler uses
an heuristic and when using layout he tries to preserve the physical
layout. While poppler does detect columns,  I don't think it detects tables
(and all the code for that is in the TextOutputDev.cc file). That being
said, if i run your file through my pdftotext (0.24.3, dont know if that
matters) then I DO get the arrows, so you could use the arrows to
postprocess your file and add the lines you want.

Greetings

José



On Mon, Dec 2, 2013 at 12:21 PM, Nishanth Lawrence <r.nishanth.cse at gmail.com
> wrote:

> Hi ,
> Sorry my previous mail was not formatted correctly due to tables , so I
> have given links to google docs .
>
> I am using pdftotext version 0.24.2 . Following is my case
>
>
> https://drive.google.com/file/d/0Bwj-LRZNYWXvTXVZNHNyQnNNd00/edit?usp=sharing
>
> While extracting using the following command line utility
>
> pdftotext table.pdf table.txt  -layout -nopgbrk -q
>
> I am getting the following output
>
> https://docs.google.com/file/d/0Bwj-LRZNYWXvSGdwa2FXemtydDQ/edit
>
> So what I want is ,  if there in no bullet in any of the line then there
> should be empty line in opposite column , could you please tell me what to
> change in the code so that I could get an output similar to this
>
> https://docs.google.com/file/d/0Bwj-LRZNYWXvck9jMmQtWFU1VkU/edit
>
> Or at least  which part of the code has to be modified to achieve the
> above .
>
> Thanks in advance
>
>
>
> --
> With Regards
> Nishanth R Lawrence
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20131203/4704d1e9/attachment.html>


More information about the poppler mailing list