Last year I wrote a couple of similar "local" PDF tools that run in the browser with no network requests. Each is just a single HTML file that will work offline:
- https://shreevatsa.net/mobius-print/ is the earliest of these, and written for a niche use-case: "Möbius printing" of pages, which is printing out an article/paper two-sided in a really interesting order. (I've tried it and love it.)
These don't use WebAssembly, but just use the excellent "pdf-lib" JS library. To keep the file self-contained, I put the whole minified source into a <script> tag at the bottom of the (otherwise hand-written) HTML file.
Is there a .pdf tool which allows compression to a defined file size? Tools like ghostscript can compress a .pdf to different levels of quality by using different setting but not a defined file size; I understand that this has to do with the compression algorithm itself and that data could be compressed only to a certain limit, but what if the file size limit is within that limit?
I'm asking this because an user of my problem validation platform wanted a solution for this[1], because websites requiring document upload have a file size limit and often the compressed file is either above or below the prescribed file size limit thereby loosing out on quality unnecessarily.
[1]'Reduce document file size to specific size' (I have added the link to it on my profile, since it's my own platform).
It's a bit more complicated than it sounds, text streams are generally just compressed as good as they can be using whatever available scheme, the bulk of the space usage is often fonts and images. A PDF itself is not compressed so much as each part of it is compressed individually.
There's not much to do with fonts except don't embed them unless you need to, and don't have duplicate/overlapping subsets, if you do have these it is very tricky to untangle. I'm not aware of any good tool to do it automatically.
For images, it depends on the format. If your PDF has JPEG (DCTDecode) images then it will have to resample to the JPEG spec, if it's TIFF likewise, you can change the number of colour bits, or you can downsample the DPI which is a simple gs command line option, or you can change the compression scheme within the JPEG itself then replace it. There are so many avenues to approach this that I'm not sure it's something easily achieved while still obtaining a good result across all possible PDFs.
Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.
>Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.
I presumed any solution for this problem would involve multiple passes through the compression routine to hit the target size. Having to deal with text, font, images separately as you said does make that complex.
Right now, I'm just waiting to see if there's really a need gap for this. It's usually the Govt. websites which has very low limit for document upload size like 100KB, I personally take the image out of .pdf and compress it to minimum jpeg quality; but documents with multiple pages make it tricky.
100KB sounds very limiting, you'll easily go above that with just fonts in some poorly constructed PDFs.
The first rule of PDFs is to always reproduce at the source if you can, they can be modified/edited but are better considered as an append-only format because of how they are made. There are so many choices that can be made in their construction that are hard to undo later, such as per-character placement instead of per-word or per-line with spaces, each consuming more stream data because of extra overhead in offsets etc (which compresses well as a text stream but still adds up).
Taking out or resampling images like you said is probably the best starting point unless you've found there is a lot of overhead (metadata/unused objects) to trim.
I'd certainly be curious why wasm ends up being 15x slower than native binary in this case, but it's not insurmountable. All of the major commercial PDF editing suites use wasm + their own C++ based pdf engine to great effect.
The article that this is based on is here, and a good read. It seems like it's at least non-trivial to get it working, and I'd wonder how the process looks for other compiled binaries, having not tried to do that implementation from scratch.
https://dev.to/wcchoi/browser-side-pdf-processing-with-go-an...
I used existing wasm compiles of PDF tools. This use of wasm is pretty awesome to me - I often end up working on very restricted desktop clients with little customization possible, but they always let me run a browser.
That website loads WASM by embedding base64 in the HTML, which is good for saving it as a single file but horrible for WebKit support ("The operation is insecure", it complains), transfer size, and speed.
Yea, true. But pdftk is really handy if you are assembling very many pages. And it can also help you do stuff involving all odd-numbered pages or all even-numbered pages for example. So pdftk goes quite a far beyond what you can do with Preview.
I haven't seen anything better. It started as a PoC and I decided not to include table detection on the page and require the user to draw box around the table.
I use Tabula under the hood for the cell/row detection and it is really good given the correct mode is selected for the type of table. The modes are stream (find cells by spacing) or lattice (find cells by ruling lines).
The official PDF Reference is very readable, see maest's link. Just start at the bit about the file structure and data types, then the basics about the commands in the Contents stream, the graphics and text states, then whatever takes your fancy. Later versions added some extra complications such as compression of the xref table and object streams, you don't need them unless you encounter them. Don't delve too deep into the bit about fonts unless you have to, it might bend your brain. A tool like qpdf to deconstruct a PDF to an uncompressed form is very handy.
> A tool like qpdf to deconstruct a PDF to an uncompressed form is very handy.
Forgot to say this in my reply, likewise pdftk's uncompress option is my #1 stop for learning about how a PDF is built when we have issues. You can poke around and hex edit for learning too.
It is easy in an uncompressed PDF to comment out a line with % to remove objects that I think are causing issues without messing up the xref offsets. Using a fully fledged tool to edit may have side effects in how things are rewritten which ruins the point of the investigation.
Acrobat Pro has some very useful features too, but not everyone has access to that.
The spec is really well written, and will take you far if you just want the basics (i.e. not the scripting stuff they added after 1.4 which is around where you should usually stop if you just care about printing). My only issue with it is when it comes to fonts or images, you'll have to break out to additional specs to understand those formats since PDF is more of a container.
I do like this minimal example as a way to get started and see how a very basic PDF is built.
Anybody knows a simple tool I can use to turn an academic two-column paper into a single column pdf (so I can read it easily on e-paper like a remarkable)?
(Ideally I'd like to be able to run such a tool from browser/phone)
Many times I just want to clip white margins from PDFs so that it is easier to view on tablets or phones. Most viewers don't have a way to force the clipping of pages, so when you change page the zoom is lost and suddenly all the content is squished to center.
Last time I found cli programs to do it aprox. five years ago it was really difficult to find good tools to edit PDFs like that.
It's actually not trivial task, as sometimes pages have different margins, e.g. odd and even pages has different margins on folding side of page.
For something like this, how do we know that the files are not sent to a server? Am I just trusting the web app? Is there any way to be sure other than having and reading the source?
For someone who knows that these tools exists, yes this is a way.
For an ordinary user, the only 'easy' way a user can verify the claimed behaviour is to literally go offline.
Browsers do not currently have badge to verify that the app is not sending any data. I'm thinking how we were brought to trust the padlock icon browsers display for TLS supporting sites.
Something I still miss is a free and easy PDF tool which lets you delete, reorder and add pages from multiple PDFs. On Windows there is just Xodo but its UX is unfortunately subpar and on macOS you have Preview where the UI is better but once you have multiple PDFs from where you get the pages it can get confusing.
pdftk is by far my favourite for this too. It's quite fast but it does run into some file size problems when merging files as it doesn't deduplicate resources, and can crash outright on some bigger files.
JavaScript apis for browsers can do a lot now. It’s great! I built something similar recently with Mozilla’s PDF library. It’s for diffing PDFs but everything happens locally. https://parepdf.com
Sweet, when I discovered the Mozilla PDF.js I thought client side manipulation of PDFs would be a breeze.
I built a tool that required to count the number of pages of a PDF (ca. 2014-2015). At the time server side counting was the 'sure' way in my brief research.
This is a nice usecase for pdfcpu.
If you are pdftk user give the pdfcpu CLI a spin.
It is multi platform and has some nice features baked in.
https://pdfcpu.io/
- https://shreevatsa.net/pdf-pages/ is for extracting pages, inserting blank pages, duplicating or reversing pages, etc.
- https://shreevatsa.net/pdf-unspread/ is for splitting a PDF's "wide" pages (consisting of two-page spreads) in the middle.
- https://shreevatsa.net/mobius-print/ is the earliest of these, and written for a niche use-case: "Möbius printing" of pages, which is printing out an article/paper two-sided in a really interesting order. (I've tried it and love it.)
These don't use WebAssembly, but just use the excellent "pdf-lib" JS library. To keep the file self-contained, I put the whole minified source into a <script> tag at the bottom of the (otherwise hand-written) HTML file.