<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'><br><br><div>> From: aacid@kde.org<br>> To: poppler@lists.freedesktop.org<br>> Date: Mon, 13 Jul 2015 20:11:49 +0200<br>> Subject: Re: [poppler] pdftomarkdown?<br>> <br>> El Dilluns, 13 de juliol de 2015, a les 12:47:30, William Bader va escriure:<br>> > Hi,<br>> > We switched to a help desk that uses "markdown"<br>> > https://en.wikipedia.org/wiki/Markdown and I would like to convert a number<br>> > of PDF documents to markdown. Is it worth modifying pdftohtml or is there a<br>> > better way?<br>> <br>> Are html and markdown related in any way other than "they are somehow a markup <br>> language"?<br>> <br>> Cheers,<br>> Albert<br>> <br>> > My documents have simple formatting with sequences of text and screen<br>> > captures, and "pdftohtml -s -noframes test.pdf test" does a reasonable job.<br>> > Regards,<br>> > William<br></div><div><br></div><div>I was thinking of using HtmlOutputDev more like a generic output device for markup that could have a further layer to dump the markup as html or as markdown. Collecting the text and images from a PDF without scrambling them is a hard problem, and pdftohtml does a relatively good job.</div><div><br></div><div>I looked at a few other options. For pdf to markdown,</div><div><br></div><div><a href="https://github.com/johnlinp/pdf-to-markdown" target="_blank">https://github.com/johnlinp/pdf-to-markdown</a> wrote invalid image files.</div><div><br></div><div>For html to markdown,</div><div><br></div><div><a href="https://domchristie.github.io/to-markdown/" target="_blank">https://domchristie.github.io/to-markdown/</a> and <a href="https://github.com/hgilani/html2markdown" target="_blank" style="font-size: 12pt;">https://github.com/hgilani/html2markdown</a> seem to need a lot of infrastructure and possibly run from inside web pages instead of a command line.</div><div><br></div><div><a href="http://pandoc.org/" target="_blank">http://pandoc.org/</a><span style="font-size: 12pt;"> (suggested by Victor Ivrii in another message) seems OK.</span></div><div><br></div><div><a href="https://github.com/aaronsw/html2text" target="_blank">https://github.com/aaronsw/html2text</a> seems OK.</div><div><br></div><div>I saw web sites like <a href="http://devotter.com/converter" target="_blank" style="font-size: 12pt;">http://devotter.com/converter</a> but I have some issues uploading either the PDF or the HTML to a web site, and I am not sure how it would treat embedded images with a URL on our intranet. The documents are manuals with around 100 to 200 pages each, and each page has at least one screen capture.</div><div><br></div><div>For now, I am going to just stick with the PDFs and write a one line markdown entry that links to the PDF on our intranet, but thank you for the replies. I think that pdftohtml + pandoc will work.</div><div><br></div><div>Regards,</div><div>William</div><div><br></div> </div></body>
</html>