如何使用iTextSharp将HTML转换为PDF

一尘不染

如何使用iTextSharp将HTML转换为PDF

我想使用iTextSharp将以下HTML转换为PDF，但不知道从哪里开始：

<style>
.headline{font-size:200%}
</style>
<p>
  This <em>is </em>
  <span class="headline" style="text-decoration: underline;">some</span>
  <strong>sample<em> text</em></strong>
  <span style="color: red;">!!!</span>
</p>

阅读 473

2020-05-19

共1个答案

一尘不染

首先，尽管HTML和PDF是在大约同一时间创建的，但它们并不相关。HTML旨在传达更高级别的信息，例如段落和表格。尽管有控制它的方法，但最终由浏览器来绘制这些更高级的概念。PDF旨在传达
文档， 并且无论文档在何处呈现，它们都必须 “看起来”相同。

在HTML文档中，段落的宽度可能为100％，根据显示器的宽度，它可能需要2行或10行，而在打印时可能是7行，而在手机上查看时可能会乘20条线。但是，PDF文件
必须独立于渲染设备，因此无论屏幕大小如何，它都 必须始终 完全相同。

由于的 葡萄汁 以上，PDF不支持像“表”或“段落”抽象的东西。PDF支持三项基本功能：文本，线条/形状和图像。
（还有其他内容，例如注释和电影，但我想在这里保持简单。）
在PDF中，您不会说“这是一个段落，浏览器会做您的事情！”。相反，您说，“使用此确切的字体在此确切的X，Y位置绘制此文本，不用担心，我之前已经计算了文本的宽度，因此我知道它们都适合在此行上”。您也不会说“这是一张桌子”，而是说“在此确切位置绘制此文本，然后在我要在此其他确切位置绘制一个矩形”

其次，iText和iTextSharp解析HTML和CSS。而已。ASP.Net，MVC，Razor，Struts，Spring等都是HTML框架，但iText
/ iTextSharp
100％不知道它们。与DataGridViews，Repeaters，Templates，Views等相同，它们都是特定于框架的抽象。从您选择的框架中获取HTML
是您的责任，iText不会为您提供帮助。如果您说出一个异常The document has no pages或您认为“
iText没有解析我的HTML”，则几乎可以肯定您实际上
 没有HTML，您只会以为自己有。

第三，已经存在多年的内置类HTMLWorker被XMLWorker（Java
/
.Net）取代。零工作正在进行，HTMLWorker它不支持CSS文件，仅支持最基本的CSS属性，并且实际上破坏了某些标签。如果在此文件中看不到HTML属性或CSS属性和值，则可能不支持该属性HTMLWorker。XMLWorker有时可能会更复杂，但是这些并发症也使它
 更具
 扩展性。

下面是C＃代码，该代码显示如何将HTML标记解析为iText抽象，这些iText抽象会自动添加到您正在处理的文档中。C＃和Java非常相似，因此转换它应该相对容易。Example＃1使用内置HTMLWorker函数来解析HTML字符串。由于仅支持内联样式，因此将class="headline"被忽略，但其他所有内容均应正常工作。示例2与第一个示例相同，不同之处在于使用示例2
XMLWorker。Example＃3还分析了简单的CSS示例。

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

2017年更新

对于HTML到PDF的需求有个好消息。正如这个答案所表明的那样，
W3C标准 css-break-3将解决该问题。这是一个候选建议书，计划在经过测试后于今年转变为权威性建议书。

作为非标准的解决方案，有一些针对C＃的插件，如print-css.rocks所示。

2020-05-19