A scanned PDF tilted 2° will tank OCR by 10×
That sounds dramatic, but it's the actual production-OCR experience. Tesseract and most OCR engines train on text rendered horizontally. They handle a tiny bit of skew — maybe ±0.5° before you measure any drop. Past that, character segmentation starts to fail. Letters that should sit on a baseline get split across two scan-lines, descenders disappear into the line below, the engine guesses "rn" where there's an "m," misreads "1" as "l," and ends up returning a paragraph that doesn't pass a basic spellcheck. By 3-5° of skew you're losing roughly 30-50% of words on a typical bank statement, and full lines on dense pages.
ocrmypdf — the standard open-source OCR pipeline — runs deskew by default for exactly this reason. So does Adobe Acrobat's scan workflow, FineReader, and every commercial document-processing pipeline that has to hit production accuracy. The pre-OCR deskew step is not optional infrastructure; it's how the rest of the pipeline gets its accuracy numbers.
This tool runs the same deskew step in your browser. Drop a scanned PDF, watch the per-page detected angles populate, click Apply. The output is a deskewed PDF you then feed to /ocr-pdf — our browser-local OCR tool — for a clean, searchable output with text recognition that actually works.
India context — why ADF skew shows up everywhere
Indian small businesses, lawyers, CAs, and individuals run more PDF scans than almost any other audience. The reasons are familiar: bank account opening, GST registration, property mutation, court submissions, education board verifications, RTI filings, FRRO/visa paperwork, NRI document attestations. Each one demands scanned copies of original documents — usually fed through an ADF scanner at a Xerox shop, a small office multifunction printer, or a smartphone in a hurry.
ADF scanners feed paper through rollers. The mechanical reality is that paper enters the rollers slightly off-axis 80%+ of the time. The skew that results is small — typically 0.5° to 3° — but it's persistent across batches, and it's exactly enough to break OCR accuracy. Smartphone scans (any of the post-CamScanner-ban India apps, or just a plain camera capture) skew worse: hand-held photography routinely produces 5°-10° tilt depending on the user.
The documents that need OCR after scanning are almost exactly the documents you'd least like to upload to a third-party server: bank statements with account numbers and transaction history, sale deeds with PAN and Aadhaar references, divorce decrees, court orders, GST returns with revenue figures, property tax bills, school transcripts, employment letters with salary breakdowns. Sending those through an upload-based deskew or OCR service hands the data to whoever runs the service and whoever runs their cloud provider — usually US or EU servers, with whatever logging and retention policies they happen to follow that quarter.
Browser-local deskew sidesteps that entirely. The PDF is read into the page's memory, processed by WebAssembly running in your browser tab, and the deskewed output is offered as a download. The only network call is a one-time fetch of opencv.js itself from a public CDN — your PDF, your data, never leaves the browser. For India ICPs handling regulated documents, this is the only honest privacy posture.
How the skew detection works
The pipeline is the standard text-document Hough transform, adapted to run in the browser via OpenCV.js. For each PDF page:
- Render to canvas. pdf.js rasterizes the page at 1.5× device-pixel-ratio scale — high enough to expose text-edge detail without blowing memory on a 30-page document.
- Grayscale + adaptive threshold. Convert to single-channel, then binarize with an adaptive threshold (15-pixel block, mean adjustment 10) so uneven scan lighting doesn't wipe out text.
- Invert. Hough transforms find lines on bright pixels; text on a binarized scan is dark on white, so we invert.
- Canny edge detection. Standard 50/150 thresholds. The output is an edge map — the boundary of every letter.
- Probabilistic Hough lines. cv.HoughLinesP with rho=1, theta=π/180, threshold=100, min line length=100, max gap=10. This finds straight line segments at all angles.
- Filter to text-line range. For each detected line, atan2(dy, dx) gives its angle. We keep only lines in [-15°, 15°] — the realistic range for text-line skew. Vertical lines (e.g. tables, page borders), steep diagonals (e.g. underlines that run perpendicular), and decorative elements get rejected.
- Take the median. The median of remaining angles is more robust than the mean against outliers. Even in tables and forms, the dominant horizontal direction usually wins.
That median is reported per-page. You can override it by unchecking specific pages or raising the threshold — useful if a particular page has odd content that confuses the detector (large signatures, half-blank pages, decorative borders).
Why the output is raster, not vector
pdf-lib — the JavaScript library that lets us write PDFs from the browser — supports page rotation only at 90° increments via setRotation(). PDFs encode arbitrary-angle rotation as a transformation matrix on the page's content stream, and rewriting that matrix in pure JavaScript without breaking text positioning, font references, and annotation coordinates is non-trivial work that no in-browser library handles correctly today.
So the workable path is: render the page to a canvas, rotate the canvas, embed the rotated canvas as a JPEG in the output PDF. The cost is rasterization — vector text becomes pixels. For scanned PDFs (where the input is already raster) this is invisible. For born-digital vector PDFs that happen to be slightly skewed, the deskewed output loses selectable text and crisp vector graphics.
That's usually fine for the actual use case. Born-digital PDFs aren't skewed in the first place — they came out of Word, LaTeX, or a reporting tool with the text axis aligned to the page. The PDFs that need deskewing are scanned ones, where the source is already a raster image. If your input is a vector PDF and you genuinely need arbitrary-angle rotation while preserving vector content, the right tool is Ghostscript on the desktop with a custom PostScript snippet — we explain the exact command in the FAQ below.
We embed the rasterized pages at JPEG quality 0.92 and at 2× device-pixel-ratio render scale — high enough to keep text crisp at normal viewing zoom on a Retina display. The output is bigger than the input (a 5MB scanned PDF might come out at 8-12MB) because re-encoded JPEGs don't hit the same compression as the original scanner output. If size matters, run /compress on the deskewed file afterward.
When to use this vs. just OCR-ing
If you only need to make a scanned PDF searchable, you can skip directly to /ocr-pdf — Tesseract.js handles modest skew (under 0.5°) without help, and the output PDF will be searchable even if the visual layout stays slightly tilted. Use deskew first when:
- The visual skew is annoying (you're going to look at the document, not just search it).
- You need the OCR output to be unusually accurate — extracting structured data, parsing tables, running a downstream RAG pipeline.
- The skew is over ~1° (where OCR accuracy starts measurably degrading).
- The document came from a smartphone camera capture, where skew is typically 3-10°.
- You're processing many documents in batch and want consistent OCR accuracy across the set.
If you don't have a scanned PDF yet and you're working from phone camera shots, start at /scan-to-pdf first — capture the document with edge detection, get a clean scan, then run deskew on top. And if your bank-statement PDF is password-protected (most Indian banks ship statements with a DDMM-PAN password), /unlock-pdf removes the password browser-locally first so deskew can read it.
Privacy callout
Three categories of network traffic happen on this page. We list them all so the privacy claim is verifiable, not just stated.
- The page itself loads from pdfmavericks.com. Standard CDN traffic — HTML, CSS, the React client bundle, and a small icon set.
- opencv.js loads from docs.opencv.org once per session. This is a public OpenCV CDN. The fetch happens only when you click Detect skew, not on page load. After the fetch your PDF is processed against this in-memory engine — the engine itself doesn't make outbound calls.
- pdf.js worker loads from cloudflare CDN. The PDF rendering library uses a CDN-hosted Web Worker for parsing. Same property — it loads code, doesn't upload your PDF.
Your PDF bytes never appear in an outgoing request. Open dev tools, switch to the Network tab, and you'll see the three resource fetches above plus zero PDF uploads. This is the only honest privacy posture for a tool that handles bank statements, sale deeds, court orders, and salary slips.
Frequently asked questions
Why does deskewing matter before OCR?
Tesseract and most OCR engines are trained on horizontally-aligned text. A 1° tilt is barely visible to a human reader but it cuts character recognition accuracy noticeably; by 3-5° the engine starts breaking words apart and missing punctuation entirely. Scanned bank statements, property documents, and court orders coming off automatic-document-feeder (ADF) scanners almost always sit at 0.5°-3° because the paper feeds slightly off the rollers. Deskewing first turns a noisy OCR pass into a clean one — usually the difference between a searchable PDF you can grep and one full of OCR garbage. The pre-OCR deskew step is standard in production document pipelines (ocrmypdf runs it by default).
What skew threshold should I use?
The default 0.5° works for most scanned documents. Real ADF skew typically shows up at 0.5°-3°. Anything below 0.5° is usually noise from the Hough transform — the algorithm finds spurious lines in halftone patterns, header logos, and signature areas. Setting the threshold below 0.5° tends to over-correct and rotate clean pages by accident. If your file came from a smartphone camera (not a flatbed scanner), bump the threshold to 1° because phone scans have more random skew variance. If your file came from a high-end production scanner with auto-deskew, leave the threshold at 0.5° and let most pages skip — you only want to fix the few that slipped through.
Why do I lose vector quality after deskew?
pdf-lib (the in-browser library we use to write the output PDF) can only rotate by 90°, 180°, or 270° at the page level. Arbitrary-angle rotation has to happen on the rendered image, not the PDF page object. So the pipeline is: render the page to a canvas via pdf.js, rotate that canvas using OpenCV's affine warp, embed the rotated canvas back into the PDF as a JPEG. If your input was a vector PDF (text + vector graphics), the deskewed output is a raster JPEG of that page — selectable text and crisp vector lines become pixels. For scanned PDFs this is invisible because the input is already raster. For vector PDFs that are barely skewed, consider whether you actually need deskew — most vector PDFs aren't skewed in the first place. If you must preserve vectors with arbitrary-angle rotation, use Ghostscript on the desktop: gs -o out.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -c "<</Rotate 90>> setpagedevice" -f in.pdf, with a custom PostScript snippet for non-90-degree angles.
Are my files uploaded anywhere?
No. The PDF is read into your browser's memory, processed by JavaScript and WebAssembly running on this page, and offered as a download. It never crosses a network. You can verify this — open browser dev tools, switch to the Network tab, run a deskew, and confirm there are zero PDF requests. The only network call is a one-time fetch of opencv.js (~8MB) from docs.opencv.org, which loads the edge-detection engine. After that one fetch, your PDF stays on your machine. Bank statements, salary slips, Aadhaar copies, court orders, and property documents — none of it touches our servers.
How accurate is the skew detection?
On documents with horizontal text lines (printed letters, bank statements, contracts), the median Hough-line angle is accurate to within ~0.1° in our testing. Accuracy degrades on pages with lots of geometric content (tables, charts, signatures, stamps) because the transform finds non-text horizontal lines too. The median filter helps but isn't perfect — if you spot a page with an obviously wrong angle in the per-page list, uncheck it before clicking Apply. For mixed-content pages the tool errs toward zero (no rotation) when no clear text-line direction emerges, which is the safe failure mode.
Why does OpenCV.js take so long to load?
It's a ~8MB WebAssembly bundle. We host it from docs.opencv.org's CDN to avoid bloating our static site. The first time you click Detect skew in a session, the browser fetches and compiles it — usually 3-8 seconds on a decent connection. Subsequent runs in the same tab are instant because the WASM module stays in memory. We deliberately do NOT bundle opencv.js into the page itself or load it on the homepage — that would slow down every visitor on the site, even those who never use deskew. The trade-off: a one-time 8MB fetch when you actually use the tool, instead of an 8MB fetch every time anyone loads pdfmavericks.com.
Will small angle detections (0.05°, 0.1°) actually improve OCR?
No. OCR engines tolerate up to ~0.5° of skew without measurable accuracy loss. Below that, the Hough transform itself has too much measurement noise to be reliable — what looks like 0.1° detected skew is often an artifact of which edge pixels happened to round which way. The 0.5° threshold default exists for this reason. If you're chasing perfection on a clean scan, you're optimizing past where it matters; spend the effort on the actual OCR step (using /ocr-pdf with both English and Hindi traineddata for India docs) instead.