When to extract images from a PDF

Image extraction is one of those operations that sounds simple until you actually need to do it on something private. The use cases cluster into a few clear groups, and they almost all share the same property: the PDF you're holding contains data you don't want a third-party service to see.

Medical records. Hospitals routinely send imaging results, lab reports, and discharge summaries as PDFs that bundle scanned X-rays, MRI slices, ultrasound stills, and signed prescription images alongside the typed report. To file an insurance claim or share with a second-opinion specialist, you often need just the images — not the surrounding paperwork. Uploading a discharge summary to an online image extractor to pull the X-rays out is exactly the wrong privacy posture, and most jurisdictions treat health records as a special category requiring extra care. A browser-local extractor never has that problem; the PDF and every image inside it stay in the tab until you close it.

Scanned identity documents. Aadhaar copies, passport scans, driver's licence images, PAN cards, and visa stamps frequently come embedded inside larger PDFs — KYC packets, employer onboarding bundles, university applications. When you need to re-submit just the photo or just the address page, you have to extract the underlying image. Doing that with an upload-based tool means handing the entire ID image to a server you don't control. Pull it out locally, and the only copy that ever existed leaves your device when you decide, where you decide.

Image-heavy reports. Real-estate listings, product catalogues, research papers with figures, and construction-site progress reports all bundle dozens or hundreds of images into a single PDF. Recreating them from a source you don't have means starting from the embedded image streams. A browser-local extractor packages them into one ZIP in a single click; an upload-based extractor adds a network round trip and an outside party to the operation, with no technical benefit.

How the extraction actually works

A PDF stores images as XObject streams — independent objects referenced from page content via short names like /Im0 or /Image1. When a page is rendered, the content stream issues a paintImageXObject instruction for each image, and the renderer fetches the corresponding object, decodes its bytes, and draws the result on the canvas at the requested position.

This tool short-circuits the rendering step. Instead of rasterising the entire page, it walks the page's operator list with pdf.js, watches for image-paint operators, and uses page.objs.get to resolve each image into either a modern ImageBitmap (which we draw to an OffscreenCanvas) or a raw byte buffer with width / height / kind metadata (which we reconstruct into RGBA pixel data and blit onto a canvas). The canvas is then encoded to PNG or JPG via canvas.convertToBlob, or — when the embedded stream is already a JPEG or PNG byte sequence and you've chosen Original mode — written into the ZIP untouched.

JSZip handles the archive build. We use DEFLATE compression at level 6 (the default), which trades a small amount of compression ratio for noticeably faster archive generation. For PNG and JPEG files (already compressed), level 6 versus level 9 saves seconds with negligible size difference; for raw bitmap content, the difference can matter, but raw bitmaps rarely appear in real-world PDFs.

Privacy posture

The tool runs entirely in your browser. The PDF you select is read into memory using File.arrayBuffer() — a browser-native API, no upload — and pdf.js parses it locally. Image decoding happens on your device. The ZIP is constructed in your browser. The resulting download URL is a blob: URL that lives in your tab's memory and gets revoked when you reset the tool or close the page.

What we do log: tool-started events with your chosen output format and minimum-size setting; tool-completed events with image counts and timing buckets; tool-failed events with a short error message when something doesn't decode. None of those events contain the file's bytes, the file's name, or the contents of any extracted image. They exist so we can tell, in aggregate, whether the tool is succeeding on real user files. You can verify this by opening DevTools, going to the Network tab, and watching what gets sent during a run — you'll see PostHog events with metadata, no PDF, no images.

Comparison with upload-based competitors

Most popular PDF-image-extraction tools — including the major "ilovepdf-style" services — require you to upload your PDF to their servers, run extraction there, and download the ZIP back. That model has three problems for the use cases above.

First, network latency. A 50 MB medical-record PDF takes meaningful time to upload over an average home connection; browser-local extraction starts the moment you drop the file. Second, a stranger's server now has a copy of your PDF. Most services document a retention window and a deletion policy, but those are policies, not technical impossibilities; the file existed on someone else's disk for some period of time. Third, the extraction logic and quality are exactly the same — pdf.js is open-source and what most server-side tools use under the hood. There's no quality benefit to the upload model; it's just where the code runs.

Common failure modes and how we handle them

Encrypted PDFs. If the PDF is password-protected, pdf.js refuses to read the image streams. The tool surfaces that as an error rather than pretending to succeed. Use the unlock-pdf tool first — also browser-local — to strip the password, then come back here.

JBIG2 / CCITT scans. Old fax-style scan pipelines often produce PDFs with images encoded in JBIG2 or CCITT Group 4. pdf.js bundles decoders for both, but they don't handle every variant in the wild. When decoding fails, we count the image as skipped rather than crashing the run — you'll see "X images skipped (too small or unsupported encoding)" in the result, and the ZIP contains everything that did decode.

Inline images. A PDF can embed tiny images directly inside a content stream (the paintInlineImageXObject operator), without creating a referable object. These typically come from old fonts or signature stamps and are unusual; we count them in the pre-scan total but skip them during extraction because pdf.js exposes them differently across versions and the data isn't reliably reachable.

Vector graphics that look like images. A chart in a financial report, an icon in a marketing PDF, or a logo built from Bezier paths is not an image — it's a series of moveTo / lineTo / fill operators. The image extractor will report 0 images for a page that's all vector. That's not a bug; you need a PDF → SVG converter to capture vector content, or a page-rasterise step to convert the whole rendered page into a single image.

Frequently asked usage questions

Can I batch-process several PDFs? Not in this version — one PDF at a time. Most extraction sessions are a single document anyway, and batching gets messy when the resulting ZIPs need to be kept separate (per-document) versus merged. If batch is the dominant workflow for you, drop a note via the contact page and we'll prioritise it.

Can I extract images from a specific page only? Not yet. The current run extracts images from every page. If you only want one page's images, the workaround is to use the extract-pages tool to pull that single page into a one-page PDF, then run that one-pager through this extractor.

What naming scheme do the files use? <pdf-name>-page<N>-img<M>.<ext> — e.g. medical-discharge-page3-img7.jpg. Page numbers are 1-indexed; image numbers are sequential across the entire PDF, not reset per page, so you can sort the ZIP and follow the extraction order.

Related tools on PDF Mavericks

Once you have the images extracted, you may want to keep processing them. We have other browser-local tools that pair naturally with this one:

  • Redact PDF — burn-in redaction boxes before you share a PDF that still contains sensitive images.
  • Delete PDF Pages — drop pages you don't want in the source PDF before extracting images.
  • Aadhaar Mask — mask the first 8 digits on a scanned Aadhaar before sharing the resulting image.
  • Unlock PDF — strip a password so you can run image extraction on a previously locked file.
  • PDF to CSV — when what you actually wanted was tabular data rather than raster images.

All of these run in your browser the same way this one does. None of them upload the file to a server. That's the whole product line.