千百種面對 PDF 的方法

R0
Day 3, 14:35‑15:20
Chinese talk w. English slides
Python Libraries
Experienced

PDF 的存在是為了跨平台文件呈現以及列印的方便，利用一個統一的 PDF 檔案規範，藉此使文件不論在哪裡看起來都一模一樣。由於 PDF 的這般特性以及它的格式設計，一般來說我們不會寫程式從 PDF 中抓取資料，但有時候不得已必須從中抓取一些資訊，這時就必須使用一些工具來剖析 PDF。這個演講中將會介紹 poppler 以及 PDFMiner 等工具，用千百種方法來面對 PDF。

註：本演講和圖像辨識無關，而是注重於直接被儲存在 PDF 裡的內容，如文字、圖片、向量圖形等等

Talk Detail

使用到的工具： - poppler: https://github.com/danigm/poppler - pdftocairo: 將 PDF 轉為 SVG - pdftohtml: 將 PDF 轉為 XML（不是轉為 html） - svgpathtools: https://github.com/mathandy/svgpathtools - PDFMiner: https://github.com/euske/pdfminer - with six: https://github.com/pdfminer/pdfminer.six 有時間的話會稍微講一個實作範例，是將一個有很多頁的 PDF 的內容、圖片及表格剖析出來並產生出結構化的 HTML

Slides Link

https://www.slideshare.net/AdrianLiaw/pdf-thousand-ways-to-deal-with-pdf

Speaker Information

廖偉涵 Adrian Liaw

高中三年都在家自學還可以延畢的高三自學生，用行動證明就連缺乏教育的屁孩也可以學會寫 Python，常常被嘴說現在的小朋友都怎樣怎樣的。

Adrian is a guy who doesn't really know what he's doing and seriously needs some life advises. He's currently a high school student but not actually in a physical high school because of the lack of school that accepts him. Oh by the way he does Python.