PDF Tools
Free
No Upload

Extract Images from PDF Online Free: The Browser Approach

Extract images from PDF online free without uploading: how XObject vs InlineImage differ, why JBIG2 scans skip, original-vs-PNG-vs-JPG output, ZIP packaging trade-offs.

PDF Mavericks·

You opened a 200-page hospital discharge summary, scrolled to the page with the chest X-ray, and tried to right-click the image to save it. PDF viewers do not allow that — the X-ray is an embedded XObject inside the page content stream, not a separate file. You want to extract images from pdf online free without uploading the whole discharge summary to a stranger's server, because the PDF carries patient identifiers, lab results, and the kind of data that triggers HIPAA exposure if it leaks.

This guide walks through how PDF image extraction actually works under the hood, why some images cannot be extracted at all, the trade-offs between server, desktop, and browser tools, and how to recover when an extraction returns "0 images detected" on a PDF that obviously contains pictures. The tool itself runs in your browser; this guide explains the pieces so you can pick the right output mode and diagnose issues without trial-and-error.

Why image extraction is harder than it sounds

A PDF is not a folder of pages with images in it. The PDF 1.7 / ISO 32000-1 specification stores images as XObject streams — independent objects in the file's object table, referenced from page content via short names like /Im0 or /Image1. When a page renders, the content stream issues a paintImageXObject instruction; the renderer fetches the corresponding object, decodes its bytes according to the stream's declared filter chain, and draws the result on the canvas at the requested position and scale.

That sounds simple until you read the filter chain. PDF supports at least seven image-encoding paths:

  • DCTDecode (JPEG). The bytes are a standard JPEG file. Easy to extract — write the bytes to a .jpg.
  • FlateDecode (zlib-compressed bitmap). Raw RGB or RGBA pixels with zlib compression. Decode and re-encode as PNG, or write as a raw bitmap.
  • JPXDecode (JPEG 2000). Modern, less common, partial decoder support.
  • CCITTFaxDecode. Group 3 or Group 4 fax compression, used in old scanned PDFs. Decoder support varies.
  • JBIG2Decode. Mostly used for monochrome scanned text. The pdf.js bundled decoder handles common variants but not every wild PDF.
  • RunLengthDecode and LZWDecode. Legacy encodings still seen in older PDFs.
  • InlineImage. Tiny images embedded directly in a page content stream using BI/ID/EI markers, without an XObject reference. Often signature stamps and old fonts.

Each of these has to be detected, decoded, and re-encoded for the user. There is also a separate concept called MaskedImage — an image plus a separate mask object that tells the renderer which pixels are transparent. Naive extractors will pull the image and miss the mask, which shows up as a JPG with a black background where transparency was supposed to be. Doing this correctly means resolving both objects and compositing on a canvas, which is what a browser-based extractor that uses pdf.js does internally.

Three approaches: server, desktop, browser

Three tool categories handle PDF image extraction. They share the same core logic — most server tools run pdf.js, pdfjs-dist, or a wrapper around Poppler's pdfimages utility under the hood. The difference is where the code runs and what that means for your file.

1. Server-based tools

ilovepdf, smallpdf, and similar online services upload the PDF to their server, run extraction with Poppler or a pdf.js backend, and return a ZIP. The extraction quality is comparable to anything else; the cost is privacy. A 50 MB medical-record PDF has to upload over your home connection, sit on someone else's disk for the duration of processing, and the policies governing retention are documents written by the vendor — not technical impossibilities. Most services document a deletion window, but a logging misconfiguration, a request retry that hits a cache, or a data breach at the provider can still expose the file. For HIPAA-regulated content, this is not the default you want.

2. Desktop tools

Poppler's pdfimages -all input.pdf out-prefix command-line utility is the most reliable extractor in the ecosystem; it handles every encoding pdf.js does plus a few obscure ones. The downside is the install step. On macOS you need Homebrew (brew install poppler); on Windows you need to download the static binaries and add them to PATH; on Linux it is one apt-get away. For a one-off extraction this is friction; for repeat use by someone comfortable in a terminal, it is the gold standard. ImageMagick and pdftk are adjacent options with weaker image-extraction support.

3. Browser tools

The browser path uses pdf.js inside the page itself. The PDF is read into memory using File.arrayBuffer(), an HTML5 API that gives JavaScript access to the file's bytes without an HTTP upload. pdf.js parses the document, walks each page's operator list, and emits image objects via page.objs.get. Modern browsers expose those as ImageBitmap objects that can be drawn directly to an OffscreenCanvas, then encoded with canvas.convertToBlob. JSZip handles the archive on the same thread.

The privacy posture is straightforward: nothing leaves your tab. You can verify it in DevTools — open the Network tab, run an extraction, and the only outbound traffic should be PostHog metadata events with image counts and timing. The file itself, the embedded image bytes, and the resulting ZIP never cross a network boundary.

How to extract images in your browser

  1. Open the tool. Go to the extract images from PDF page and drop a single PDF into the upload zone. The file stays on your laptop.
  2. Pick the output format. Original (preserves the embedded JPEG or PNG bytes byte-for-byte, zero re-encoding loss), PNG (uniform lossless format, larger files for photographic content), or JPG (uniform lossy format, smallest ZIP at quality 0.92). Original is the right default for medical records and archival.
  3. Set the minimum-size filter. Default is 100 pixels — anything smaller (icons, signature stamps, page decorations) gets skipped. Lower it to 0 if you want everything; raise it to 500 if you only want content-bearing images.
  4. Run extraction. Click extract. The pre-scan walks every page, the decode pass resolves each image, and the ZIP build packages the result. Progress shows page-by-page. Total time is a few seconds for small PDFs, up to a minute or two for large image-heavy ones.
  5. Download the ZIP. Single click. The result panel shows total images extracted, count by format, and any skipped items with reasons (encoding limit, below size threshold, decoder failure).

Original vs PNG vs JPG output

The output mode determines whether your extracted images go through a re-encoding pass. The choice matters more than it looks like.

Original is byte-for-byte preservation. When an embedded JPEG or PNG stream is found, the bytes are written into the ZIP unchanged. The file you download is the file the PDF carried internally, with the same compression artifacts, color profile, and metadata. This is the right choice for medical imaging (radiology requires lossless transfer for re-read), forensic evidence (chain-of-custody requires no re-encoding), and archival (the original byte sequence is the artifact). The only downside is non-uniformity — a PDF with mixed JPEG and PNG embeddings produces a ZIP with mixed extensions, which some downstream pipelines have to handle.

PNG is uniform lossless. Every image is decoded into an RGBA canvas, then re-encoded as PNG via canvas.convertToBlob('image/png'). The re-encode is lossless, but PNG's compression is worse than JPEG for photographic content — a 200 KB embedded JPEG becomes a 600 KB to 1.5 MB PNG depending on the content. Pick this when you need a single format for downstream tooling and the originals are a mix.

JPG is uniform lossy. Every image is re-encoded at quality 0.92 — visually indistinguishable for photographic content but a real generation-loss step for text-heavy scans, line art, or any content with sharp edges. Pick this when ZIP size matters more than fidelity and the content is photographic. Avoid for medical records, scanned text, or anything with hard edges.

Real scenarios: records, journalism, research

1. Medical records

Hospitals send imaging results, lab reports, and discharge summaries as PDFs that bundle scanned X-rays, MRI slices, ultrasound stills, and signed prescription images alongside the typed report. To file an insurance claim or share with a second-opinion specialist, you often need just the images — not the surrounding paperwork. Original output mode is the correct choice; the radiology-grade fidelity has to survive the extraction step. Use the browser tool, not an upload-based extractor — the surrounding report contains patient identifiers and HIPAA-protected health information, and most jurisdictions treat health records as a special category requiring extra care.

2. Journalism with scanned IDs in evidence

Investigative reporting often involves leaked or FOIA-released documents that bundle scanned IDs, contracts, and photographs into multi-hundred-page PDFs. Reporters need to extract specific images for verification (reverse image search) and publication (with appropriate redaction). The entire workflow must stay on the reporter's machine — sources are protected only as well as the document handling is, and any upload step is a leak vector. Browser extraction is the only path that preserves the source-protection posture; pair with redact-pdf for any image you need to publish with identifying details blacked out.

3. Research with embedded figures

A 200-page environmental impact report or epidemiology preprint usually has 30 to 80 figures embedded as images — charts, maps, scanned field notes, photographs. To cite or reuse those figures (with attribution) in a literature review, you need them as separate files. Original mode preserves the publication-quality embedded raster; PNG mode normalizes a mixed batch for LaTeX or reference-management ingestion. If the report is co-authored and not yet published, the privacy of the unpublished figures is also an issue — browser extraction keeps them on your laptop.

Why extractions fail and how to recover

A few specific failure modes recur. Each has a clear recovery path.

0 images detected on a page that obviously has pictures. The page content is vector — Bezier paths,moveTo / lineTo / fill operators that the renderer turns into pixels at draw time. Charts in financial reports, icons in marketing PDFs, and logos built from vector shapes all show up this way. There is nothing for an image extractor to pull. To capture vector content, use a PDF-to-SVG converter, or rasterize the entire page to PNG and accept that you have lost the vector resolution. A second possibility: your minimum-size filter is excluding everything — drop it from 100 to 0 and re-run.

X images skipped (unsupported encoding). Old fax-pipeline scans use JBIG2 or CCITT Group 4 with variants the bundled decoder cannot handle. The extractor counts those as skipped rather than crashing. Recovery options: try the same PDF in Poppler's pdfimages utility (broader decoder coverage), or open the PDF in a modern reader, render each affected page to a high-DPI image, and use the rendered raster instead of the original embedded stream.

Encrypted PDF rejected. Password-protected PDFs hide the image streams under encryption; pdf.js refuses to read them until the password is supplied. Use a browser-local unlock-pdf tool to strip the password first, then run image extraction on the unlocked file. The whole workflow stays on your laptop because both tools are client-side.

Run stalls or browser tab freezes. File size above the 100 MB browser-memory soft cap. Recovery: use the extract-pages tool to pull a smaller range out of the source PDF, then run image extraction on the smaller file. Most large PDFs are scanned reports where every page is one image, so a 500-page scan can be split into five 100-page chunks and extracted sequentially.

Duplicate images, page after page. Not a failure — a logo, watermark, or signature stamp is reused across pages, and the extractor writes one copy per page that uses it. Within a page we deduplicate; across pages we preserve the duplicates because the user goal is often to recover every visible image position. If you only want unique source images, deduplicate the ZIP afterward by file hash (any duplicate-finder utility works on the unzipped output).

Wanted tabular data, not images. The PDF looks like it has data tables embedded as images, but actually contains real text in a table layout. An image extractor returns nothing useful; what you want is pdf-to-csv for tabular extraction. Confirm by trying to select the table text in a PDF reader; if it highlights, it is text, not an image.

Your PDF stays in your browser tab

Image extraction on PDF Mavericks runs on pdf.js inside your browser. The PDF is read with File.arrayBuffer, parsed locally, decoded image-by-image, and ZIP-packaged on your device. Medical records, scanned IDs, and unpublished research figures never leave your laptop.

Frequently asked questions

Why are some images skipped during extraction?

Two reasons cover almost every skipped image. First, encoding limits — old fax-pipeline scans use JBIG2 or CCITT Group 4 compression, and the bundled decoder in pdf.js does not handle every variant in the wild. The extractor counts those as skipped rather than crashing the run. Second, inline images. A PDF can embed tiny rasters directly inside a page content stream using the paintInlineImageXObject operator, without creating a separate referable object; pdf.js exposes those differently across versions and the bytes are not reliably reachable, so we count them in the pre-scan total and skip them at write time.

Why is image extraction slow on big PDFs?

The slow part is not the extraction itself — it is the pre-scan that walks every page operator list to count and locate image XObjects. A 500-page report with one image per page is fine; a 200-page report with 30 images per page means 6,000 operator-list entries to enumerate, then 6,000 page.objs.get calls to resolve each image, then a canvas draw and encode for each one. JSZip's DEFLATE pass is the cheap step. Browser memory caps around 100 MB on most laptops; above that the canvas re-encode pass starts to dominate runtime, especially on mobile devices.

Can I extract vector graphics like charts and logos?

No — and the reason matters. A vector chart in a financial report or a logo built from Bezier paths is not stored as an image inside the PDF. It is a series of moveTo, lineTo, curveTo, and fill operators in the page content stream, which the renderer turns into pixels at draw time. An image extractor walks the operator list looking for paintImageXObject calls, finds none on a vector page, and reports zero images. To capture vector content you need a PDF-to-SVG converter, or you have to rasterize the entire page to an image (PDF to PNG) and lose the vector resolution.

What about scanned PDFs — can I get the original page images?

Yes, in the simplest possible way: in a scanned PDF, the page IS the image. The scanner produced one big raster (usually JPEG or CCITT) per page and wrapped it in a thin PDF container. The image extractor finds one image per page, pulls the raw bytes out, and writes them to the ZIP. In Original output mode you get the scanner's original JPEG bytes with no re-encoding loss, which is the right choice for medical records, archival use, or anything where pixel fidelity matters. The downside is file size — scanned-page JPEGs are typically 200KB to 2MB each, so a 200-page scan unzips to several hundred megabytes.

What naming pattern do the extracted files use?

The naming scheme is <pdf-name>-page<N>-img<M>.<ext> — for example, medical-discharge-page3-img7.jpg or research-report-page12-img4.png. Page numbers are 1-indexed. Image numbers are sequential across the entire PDF, not reset per page, so you can sort the ZIP by name and follow the extraction order across page boundaries. The extension reflects the actual encoding written to disk: .jpg for JPEG bytes, .png for PNG bytes, plus a few less common cases (.bin for raw bitmap data when a stream cannot be classified safely).

Why a single ZIP instead of individual file downloads?

Browsers throttle automatic multi-file downloads as a security measure. A 30-image extraction would either trigger a confirmation dialog for every file or silently fail after the first few — Chrome blocks more than 10 sequential downloads from a single page action by default, Safari is stricter, Firefox depends on user-pref settings. Packaging into one ZIP sidesteps the throttle, gives you a single click to save, and keeps the extracted images grouped with their source PDF&apos;s name so they are easy to find later. Most operating systems unzip in place with a double-click, so the friction is minimal.

Should I pick Original, PNG, or JPG output?

Original is the right default for archival, medical records, and any case where pixel fidelity matters — it writes the embedded JPEG or PNG bytes byte-for-byte into the ZIP with zero re-encoding. PNG is the right choice when you need a uniform format for downstream processing and the originals are a mix of JPEG and PNG; the canvas re-encode is lossless but the file sizes get larger because PNG is worse at compressing photographic content. JPG output uses quality 0.92 and produces the smallest ZIP, but every extracted image goes through one generation of lossy re-encoding even if the original was already JPEG, which is a measurable quality drop on text-heavy scans.

Is the PDF uploaded to a server during extraction?

No. The PDF is read into browser memory using File.arrayBuffer, parsed locally with pdf.js, decoded image-by-image on your device, and packaged into a ZIP locally with JSZip. Nothing leaves your machine — no upload step, no temporary cloud copy, no third-party processor. We log a tool-started and tool-completed event for our own diagnostics; those events contain image counts and timing, but no file bytes, no file name, and no extracted image data. You can verify this in DevTools by opening the Network tab during a run and confirming only PostHog metadata events go out.

Related guides