Hacker News new | past | comments | ask | show | jobs | submit login
Local PDF Tools – Powered by WebAssembly (localpdf.tech)
227 points by twapi on March 3, 2021 | hide | past | favorite | 59 comments



Last year I wrote a couple of similar "local" PDF tools that run in the browser with no network requests. Each is just a single HTML file that will work offline:

- https://shreevatsa.net/pdf-pages/ is for extracting pages, inserting blank pages, duplicating or reversing pages, etc.

- https://shreevatsa.net/pdf-unspread/ is for splitting a PDF's "wide" pages (consisting of two-page spreads) in the middle.

- https://shreevatsa.net/mobius-print/ is the earliest of these, and written for a niche use-case: "Möbius printing" of pages, which is printing out an article/paper two-sided in a really interesting order. (I've tried it and love it.)

These don't use WebAssembly, but just use the excellent "pdf-lib" JS library. To keep the file self-contained, I put the whole minified source into a <script> tag at the bottom of the (otherwise hand-written) HTML file.


I hope to never again print more than 2 pages in a go in my life, but if I do, I'm definitely going to use your Möbius printing tool, it's genius.


Very cool!! Thanks for sharing :)


Very useful pdf tools


Hey, Thanks for Posting it here, i built this tool, hope you like it, feel free to look at my source and contribute. https://github.com/jufabeck2202/localpdfmerger


Is there a .pdf tool which allows compression to a defined file size? Tools like ghostscript can compress a .pdf to different levels of quality by using different setting but not a defined file size; I understand that this has to do with the compression algorithm itself and that data could be compressed only to a certain limit, but what if the file size limit is within that limit?

I'm asking this because an user of my problem validation platform wanted a solution for this[1], because websites requiring document upload have a file size limit and often the compressed file is either above or below the prescribed file size limit thereby loosing out on quality unnecessarily.

[1]'Reduce document file size to specific size' (I have added the link to it on my profile, since it's my own platform).


It's a bit more complicated than it sounds, text streams are generally just compressed as good as they can be using whatever available scheme, the bulk of the space usage is often fonts and images. A PDF itself is not compressed so much as each part of it is compressed individually.

There's not much to do with fonts except don't embed them unless you need to, and don't have duplicate/overlapping subsets, if you do have these it is very tricky to untangle. I'm not aware of any good tool to do it automatically.

For images, it depends on the format. If your PDF has JPEG (DCTDecode) images then it will have to resample to the JPEG spec, if it's TIFF likewise, you can change the number of colour bits, or you can downsample the DPI which is a simple gs command line option, or you can change the compression scheme within the JPEG itself then replace it. There are so many avenues to approach this that I'm not sure it's something easily achieved while still obtaining a good result across all possible PDFs.

Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.


I appreciate your detailed comment.

>Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.

I presumed any solution for this problem would involve multiple passes through the compression routine to hit the target size. Having to deal with text, font, images separately as you said does make that complex.

Right now, I'm just waiting to see if there's really a need gap for this. It's usually the Govt. websites which has very low limit for document upload size like 100KB, I personally take the image out of .pdf and compress it to minimum jpeg quality; but documents with multiple pages make it tricky.


100KB sounds very limiting, you'll easily go above that with just fonts in some poorly constructed PDFs.

The first rule of PDFs is to always reproduce at the source if you can, they can be modified/edited but are better considered as an append-only format because of how they are made. There are so many choices that can be made in their construction that are hard to undo later, such as per-character placement instead of per-word or per-line with spaces, each consuming more stream data because of extra overhead in offsets etc (which compresses well as a text stream but still adds up).

Taking out or resampling images like you said is probably the best starting point unless you've found there is a lot of overhead (metadata/unused objects) to trim.


I'd certainly be curious why wasm ends up being 15x slower than native binary in this case, but it's not insurmountable. All of the major commercial PDF editing suites use wasm + their own C++ based pdf engine to great effect.

The article that this is based on is here, and a good read. It seems like it's at least non-trivial to get it working, and I'd wonder how the process looks for other compiled binaries, having not tried to do that implementation from scratch. https://dev.to/wcchoi/browser-side-pdf-processing-with-go-an...


Looks like this thread is all about sharing our own related local browser-based PDF tools. Here’s mine: https://pdftotext.github.io


There are a few versions of tools like this, or similar, available. Here's mine:

https://kc0bfv.github.io/WASM-PDF-Combiner/

I used existing wasm compiles of PDF tools. This use of wasm is pretty awesome to me - I often end up working on very restricted desktop clients with little customization possible, but they always let me run a browser.


Seems to be stuck at “Loading” for me on iOS safari.


That website loads WASM by embedding base64 in the HTML, which is good for saving it as a single file but horrible for WebKit support ("The operation is insecure", it complains), transfer size, and speed.


Thanks for that - I haven't tried it there...

Yup - single file operation was a design goal, but I bet there's a better way to get it working than the hack I used.


Looks cool


Convenient if you are on a machine where you can’t install software. (Corporate computer, school computer, library computer etc.)

For Linux and macOS computers that you are allowed to install software on I recommend the pdftk command line tool.

Ubuntu family:

  sudo apt install pdftk
macOS with Homebrew:

  brew install pdftk-java


> For Linux and macOS computers that you are allowed to install software on I recommend the pdftk command line tool.

If you're on a Mac, the built in Preview tool has had the ability to merge and manipulate PDF documents for years.


Yea, true. But pdftk is really handy if you are assembling very many pages. And it can also help you do stuff involving all odd-numbered pages or all even-numbered pages for example. So pdftk goes quite a far beyond what you can do with Preview.


Can one easily install such apps as a Chrome app/PWA, and deactivate access to the internet since it doesn't need it and one can merge personal PDFs?


I created a PDF table extractor tool last year with the same idea that it should be local only. Try it here: https://pdftableutil.possiblenull.com/app/ Also as a Google Docs addon (still local only) https://workspace.google.com/marketplace/app/pdf_table_impor...

I had a bad case of scope creep, so the tool can also extract tables from scanned/image PDFs using OpenCV.js and tesseract OCR wasm build!


Wow That looks awesome, what did you use to display the PDF in the Browser? feels all really responsive!


I used Mozilla's PDF.js https://mozilla.github.io/pdf.js/ It is what firefox uses on desktop to show PDFs!


Thanks, really great work!


This is interesting. How accurate would you say it is?


I haven't seen anything better. It started as a PoC and I decided not to include table detection on the page and require the user to draw box around the table.

I use Tabula under the hood for the cell/row detection and it is really good given the correct mode is selected for the type of table. The modes are stream (find cells by spacing) or lattice (find cells by ruling lines).

The OCR/OpenCV seemed to be fine as well as long as the text isn't too blurry. Here is a GIF of the OCR/OpenCV running on an example Image PDF: https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...


I’d love a tool (that’s not Acrobat) to manage comments on PDFs.


Okular has support for comments, although I don't know if they're compatible with Acrobat's.


This is helpful. Generally if I need to do any pdf manipulation when I'm away from my own machine I use an android app - PDF Utils [1].

[1] https://play.google.com/store/apps/details?id=pdf.shash.com....


Does anyone know some good tutorials/explanations for understanding the PDF format at the byte level?


The official PDF Reference is very readable, see maest's link. Just start at the bit about the file structure and data types, then the basics about the commands in the Contents stream, the graphics and text states, then whatever takes your fancy. Later versions added some extra complications such as compression of the xref table and object streams, you don't need them unless you encounter them. Don't delve too deep into the bit about fonts unless you have to, it might bend your brain. A tool like qpdf to deconstruct a PDF to an uncompressed form is very handy.


> A tool like qpdf to deconstruct a PDF to an uncompressed form is very handy.

Forgot to say this in my reply, likewise pdftk's uncompress option is my #1 stop for learning about how a PDF is built when we have issues. You can poke around and hex edit for learning too.

It is easy in an uncompressed PDF to comment out a line with % to remove objects that I think are causing issues without messing up the xref offsets. Using a fully fledged tool to edit may have side effects in how things are rewritten which ruins the point of the investigation.

Acrobat Pro has some very useful features too, but not everyone has access to that.


The spec is really well written, and will take you far if you just want the basics (i.e. not the scripting stuff they added after 1.4 which is around where you should usually stop if you just care about printing). My only issue with it is when it comes to fonts or images, you'll have to break out to additional specs to understand those formats since PDF is more of a container.

I do like this minimal example as a way to get started and see how a very basic PDF is built.

https://brendanzagaeski.appspot.com/0004.html



Anybody knows a simple tool I can use to turn an academic two-column paper into a single column pdf (so I can read it easily on e-paper like a remarkable)?

(Ideally I'd like to be able to run such a tool from browser/phone)


Maybe try k2pdfopt (https://www.willus.com/k2pdfopt/) It is a Windows/Linux application though.


Looking forward to using this tool! Are there plans to make this open source?


It is based on an existing open source project:

https://github.com/pdfcpu/pdfcpu


Here’s the open source repo: https://github.com/jufabeck2202/localpdfmerger


It's great to see more PDF tools.

Many times I just want to clip white margins from PDFs so that it is easier to view on tablets or phones. Most viewers don't have a way to force the clipping of pages, so when you change page the zoom is lost and suddenly all the content is squished to center.

Last time I found cli programs to do it aprox. five years ago it was really difficult to find good tools to edit PDFs like that.

It's actually not trivial task, as sometimes pages have different margins, e.g. odd and even pages has different margins on folding side of page.


For something like this, how do we know that the files are not sent to a server? Am I just trusting the web app? Is there any way to be sure other than having and reading the source?


You can load the website and then disable the network, either by turning off the connection in your OS or via File -> Work Offline in Firefox.

I just did this and it worked.


Good tip. Though it would be nice to be able to disable network activity per tab.


Open dev tool and monitor network?


For someone who knows that these tools exists, yes this is a way.

For an ordinary user, the only 'easy' way a user can verify the claimed behaviour is to literally go offline.

Browsers do not currently have badge to verify that the app is not sending any data. I'm thinking how we were brought to trust the padlock icon browsers display for TLS supporting sites.


Something like that would be great, seeing that you can do more and more with wasm locally it would be really useful


Disconnect from the network and try it? Or disconnect from the network always when using this?


Something I still miss is a free and easy PDF tool which lets you delete, reorder and add pages from multiple PDFs. On Windows there is just Xodo but its UX is unfortunately subpar and on macOS you have Preview where the UI is better but once you have multiple PDFs from where you get the pages it can get confusing.


On Linux there is https://github.com/pdfarranger/pdfarranger It has a nice UI and is pretty easy to use.


PdfTk does this. I use the CLI version, but I think there is one with a GUI as well.


pdftk is by far my favourite for this too. It's quite fast but it does run into some file size problems when merging files as it doesn't deduplicate resources, and can crash outright on some bigger files.


On Windows there ist: https://www.pdf24.org/


I have this same idea on my to-do list. Great that people are experimenting with webapps that don't send any data!


JavaScript apis for browsers can do a lot now. It’s great! I built something similar recently with Mozilla’s PDF library. It’s for diffing PDFs but everything happens locally. https://parepdf.com


Sweet, when I discovered the Mozilla PDF.js I thought client side manipulation of PDFs would be a breeze.

I built a tool that required to count the number of pages of a PDF (ca. 2014-2015). At the time server side counting was the 'sure' way in my brief research.


Hmm, I think I would just compile the pdfcpu Go source to native code, that might be faster than WebAssembly?


Definitely


This is a nice usecase for pdfcpu. If you are pdftk user give the pdfcpu CLI a spin. It is multi platform and has some nice features baked in. https://pdfcpu.io/


There was an error merging PDFs. Not very helpful, can you tell me what the error was or how to avoid it?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: