[poppler] pdftomarkdown?

William Bader williambader at hotmail.com
Mon Jul 13 12:49:57 PDT 2015



> From: aacid at kde.org
> To: poppler at lists.freedesktop.org
> Date: Mon, 13 Jul 2015 20:11:49 +0200
> Subject: Re: [poppler] pdftomarkdown?
> 
> El Dilluns, 13 de juliol de 2015, a les 12:47:30, William Bader va escriure:
> > Hi,
> > We switched to a help desk that uses "markdown"
> > https://en.wikipedia.org/wiki/Markdown and I would like to convert a number
> > of PDF documents to markdown. Is it worth modifying pdftohtml or is there a
> > better way?
> 
> Are html and markdown related in any way other than "they are somehow a markup 
> language"?
> 
> Cheers,
>   Albert
> 
> > My documents have simple formatting with sequences of text and screen
> > captures, and "pdftohtml -s -noframes test.pdf test" does a reasonable job.
> > Regards,
> > William

I was thinking of using HtmlOutputDev more like a generic output device for markup that could have a further layer to dump the markup as html or as markdown.  Collecting the text and images from a PDF without scrambling them is a hard problem, and pdftohtml does a relatively good job.
I looked at a few other options.  For pdf to markdown,
https://github.com/johnlinp/pdf-to-markdown wrote invalid image files.
For html to markdown,
https://domchristie.github.io/to-markdown/ and https://github.com/hgilani/html2markdown seem to need a lot of infrastructure and possibly run from inside web pages instead of a command line.
http://pandoc.org/ (suggested by Victor Ivrii in another message) seems OK.
https://github.com/aaronsw/html2text seems OK.
I saw web sites like http://devotter.com/converter  but I have some issues uploading either the PDF or the HTML to a web site, and I am not sure how it would treat embedded images with a URL on our intranet.  The documents are manuals with around 100 to 200 pages each, and each page has at least one screen capture.
For now, I am going to just stick with the PDFs and write a one line markdown entry that links to the PDF on our intranet, but thank you for the replies.  I think that pdftohtml + pandoc will work.
Regards,William
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20150713/93393d7d/attachment.html>


More information about the poppler mailing list