How accurately does PDF to Markdown render equations from an arXiv paper?

Equation accuracy depends on how the paper was compiled. arXiv submissions are almost always LaTeX-compiled with the embedded font metadata that PDF parsers can read — equation blocks come out as clean LaTeX inside $$...$$ fences ready for MathJax or KaTeX rendering. Inline math, such as variable names with subscripts, is preserved as $...$ fragments. The accuracy gap is narrowest on display equations and widest on equations that mix custom commands defined by the paper's authors, since those commands have no standalone definition in the extracted markdown.

Why not just run OCR on the PDF for the math?

OCR is the right tool only when the equations exist as rasterized images — for example, in a scanned paper from the 1990s or a screenshot embedded into a slide deck. Modern arXiv PDFs render equations as vector text using LaTeX math fonts, which means the equation's underlying tokens are already in the PDF as machine-readable text. Text extraction reads those tokens directly. OCR would re-rasterize and re-recognize them, introducing transcription errors that don't exist in the original PDF.

Does the extracted markdown preserve raw LaTeX, or does it convert to Unicode math symbols?

Our /pdf-to-markdown tool preserves equations as raw LaTeX inside math fences ($...$ for inline, $$...$$ for display). That's the format Obsidian, Notion, and most static-site generators expect for math rendering via MathJax or KaTeX. Converting to Unicode symbols loses information — \frac becomes a slash, \sum becomes Σ but loses its bounds, integrals lose their limits. LaTeX preservation keeps the equation re-typesettable.

How are code blocks and pseudocode detected in the extracted markdown?

Code blocks are detected through a combination of font-family heuristics (monospaced text typically indicates code) and indentation patterns. A paragraph of monospaced text with consistent left indentation gets fenced as a code block in the markdown output. Pseudocode formatted with LaTeX's algorithm or algorithmic packages is more variable — sometimes it extracts as a code block, sometimes as a numbered list, depending on how the paper's authors formatted it. A quick visual review after extraction usually catches any miscategorized blocks.

What's the best workflow for moving extracted markdown into Obsidian or Notion?

For Obsidian: save the extracted markdown as a .md file in your vault under a Papers folder, then enable the MathJax setting so equations render in preview mode. Obsidian's native search will index the equation text along with everything else, so you can find a specific equation across hundreds of papers later. For Notion: the cleanest path is to paste the markdown into a new page using Notion's markdown-import — equations need to be wrapped in inline math blocks manually after paste because Notion's importer doesn't natively understand $$...$$ fences yet.

Can I extract equations from a paper that uses non-standard LaTeX packages?

Yes, with caveats. Standard math packages (amsmath, amssymb, mathtools) extract cleanly because their commands map to widely-supported LaTeX. Custom commands defined in the paper's preamble — like \norm or \indicator — will extract as their literal command name, and you'll need to define them in your downstream renderer or replace them with the underlying expression. For most arXiv papers in ML, NLP, and theorem-proving, the standard packages cover 95%+ of the math.

Does any data from my PDF leave my browser when I use the converter?

No. The /pdf-to-markdown tool runs entirely as WebAssembly in your browser tab. The PDF is read into memory, parsed, converted to markdown, and the output is rendered back into your tab — no network round trips, no upload, no server-side copy. Close the tab and there's no residue. The first time you open the page the WebAssembly module downloads from our CDN; after that, the extraction itself is local and works offline.

AI Research

arXiv PDF

Extract Equations From the DeepMind Proof-Search Paper PDF

DeepMind released a formal proof-search paper on arXiv. It's roughly 30 pages of LaTeX-rendered equations, algorithm blocks, and benchmark tables. Researchers and ML engineers reading it on HN and Twitter want to copy specific equations or code samples into their own notes. Naive copy-paste from a PDF mangles equations and indentation. Here's the fix: extract equations from any arXiv PDF to clean markdown with LaTeX preserved, in your browser, in under a minute.

May 24, 2026

8 min read

PDF Mavericks Team

Quick path: Download the PDF from arxiv.org, drop it into our PDF to Markdown converter, and paste the result into Obsidian, Notion, or any markdown editor. Equations come out as LaTeX inside math fences, ready for MathJax.

What this guide covers

Why this paper, why now
Step 1: Get the arXiv PDF
Step 2: Why naive copy-paste from a PDF breaks math
Step 3: Extract clean markdown in your browser
Step 4: Use the markdown in Obsidian, Notion, research notes
Known limits and edge cases

Why this paper, why now

DeepMind's recent formal proof-search paper is the kind of release that lights up Hacker News and ML Twitter for 72 hours, then becomes a reference document people cite for months. The paper describes the approach, the architecture, the benchmark numbers — read the paper itself on arXiv for the substance. This post is not a paper summary. It's a workflow for getting the equations and pseudocode out of the PDF and into your own notes without losing math fidelity.

The same workflow applies to any arXiv PDF — the AlphaFold updates, the Llama technical reports, every new RLHF and scaling-laws paper that drops. arXiv is a single, predictable PDF format compiled from LaTeX. Once you have a clean extraction pipeline, every paper that lands becomes a 60-second job to file into your notes.

Step 1: Get the arXiv PDF

arXiv URLs follow a fixed pattern. For a paper with the abstract page at:

https://arxiv.org/abs/XXXX.XXXXX

The PDF lives at:

https://arxiv.org/pdf/XXXX.XXXXX

Same numeric identifier, swap abs for pdf. The PDF download is the version the authors compiled and submitted, which is also the version cited by everyone else. Optionally, you can grab the LaTeX source by switching pdf to e-print — but for most extraction work the compiled PDF is what you want, because that's what the equation numbering and figure references align to.

Version matters: arXiv papers get revised. The first version, v1, may differ from the camera-ready or current version. URLs with v2, v3 appended pin you to a specific revision. If you're citing an equation by number, pin the version — the numbering can shift between revisions.

Step 2: Why naive copy-paste from a PDF breaks math

Try this with any arXiv PDF: open it in your browser's built-in PDF viewer, select an equation, copy, paste into a text editor. What lands in your clipboard is usually one of these three failure modes:

A jumbled string of Greek letters and operators with no spacing — "Lθ=Es,a∼πθrs,a−βDKLπθ∥πref" instead of a parseable equation.
The visible Unicode characters with subscripts and superscripts flattened onto the baseline, losing the structural relationships entirely.
The right characters in the right order, but with PDF's internal ligature codes mixed in — invisible to humans, fatal to any downstream renderer.

The root cause: PDFs encode equations as glyph sequences positioned in 2D, not as LaTeX source. The math viewer reconstructs the visual layout; the clipboard copy gets the glyph sequence with no semantic structure.

A proper markdown extraction pipeline does the reverse — it reads the glyph sequence, recognizes math regions by font and positioning, and reconstructs LaTeX. The output looks like this:

$$
\mathcal{L}(\theta) = \mathbb{E}_{(s,a)\sim\pi_\theta}\big[r(s,a) - \beta \, D_{KL}(\pi_\theta \| \pi_{\text{ref}})\big]
$$

That's a math fence ready to drop into a MathJax-enabled markdown viewer. Same equation, but now it's re-typesettable, searchable as LaTeX tokens, and roundtrips back to a rendered display without loss.

Step 3: Extract clean markdown in your browser

Open our PDF to Markdown converter. The page loads a WebAssembly build of a PDF parser plus a markdown serializer. Drop the arXiv PDF onto the upload zone. Three things happen inside the browser tab:

The PDF is parsed page by page. Text, font metadata, and positional information come out together.
Math regions are identified by font family (papers compiled from LaTeX use Computer Modern, Latin Modern, or stix-math fonts) and converted back to LaTeX tokens inside $...$ for inline and $$...$$ for display.
Code blocks are identified by font family (typically Computer Modern Typewriter or stix-mono) and fenced with triple backticks.

A typical 30-page paper finishes in 3–5 seconds on a modern laptop. The output is a single markdown document with sections, equations, algorithm blocks, and tables preserved. Code blocks come out like this:

def search(state, depth):
    if depth == 0 or is_terminal(state):
        return value(state)
    best = -inf
    for action in legal_actions(state):
        v = -search(apply(state, action), depth - 1)
        best = max(best, v)
    return best

Nothing about the file leaves your browser. The PDF is read into memory, processed locally, and the markdown is rendered into the same tab. No upload, no server-side copy, no logged request body. Useful when you're reading a preprint that's still under embargo, or when you just don't want a third party tracking which papers you read.

Step 4: Use the markdown in Obsidian, Notion, or research notes

Obsidian

Save the extracted markdown as a .md file inside your vault, ideally under a Papers/ folder organized by year or by topic. Enable MathJax in Obsidian's settings (it's on by default in recent versions). Equations render in preview mode automatically. Obsidian's full-text search indexes the raw LaTeX, so you can search for D_{KL} across every paper you've ever filed and find every paper that uses KL divergence.

Tag the file with the arXiv ID in YAML frontmatter so backlinks stay stable when you rename. A minimal frontmatter block:

---
arxiv: 2401.XXXXX
authors: [DeepMind]
topic: proof-search
read: 2026-05-24
---

Notion

Notion's markdown importer accepts most fenced blocks cleanly — code blocks, headings, lists, and tables transfer without surgery. Math is the rough edge: Notion uses its own inline-equation and block-equation primitives rather than rendering $$...$$ directly. The cleanest workflow is to import the markdown for the prose and structure, then convert each math block to a Notion equation block manually. Slow on a 30-page paper, but only needs to be done once per file.

Plain markdown + static site

For a personal research blog or a Hugo/Jekyll/Astro site, the extracted markdown drops in as-is. Enable MathJax or KaTeX in the site template, and the equations render in the browser at view time. The same .md file becomes the source of truth for your local notes and the published version on your site — no separate copies, no divergence.

If text-only is enough

When you don't need equations or code-block structure preserved — for example, when you're feeding the paper into a summarization workflow that only cares about prose — our PDF to Text converter is faster and produces a smaller output. Use markdown extraction when structure matters, plain text when it doesn't.

Known limits and edge cases

Custom LaTeX macros from the paper's preamble

Author-defined commands like \norm{x} or \indicator[A] extract as their literal command name. Your downstream renderer will need the same macro definitions, or you'll need to substitute the underlying expression. Most arXiv ML papers stick to amsmath, amssymb, and mathtools — those are universally supported.

Multi-line equation alignments

align and align* environments come out as a single $$...$$ block with internal \\ line breaks. MathJax and KaTeX render this correctly, but some lightweight markdown renderers ignore the line breaks. If you see equations collapsing onto one line, your renderer is the issue, not the extraction.

Figures and figure captions

Figures don't extract as images — PDFs encode them as vector graphics or rasterized blobs that aren't usable as standalone files. Captions are extracted as italic text below a [Figure N: omitted] placeholder. To get the figures themselves, snapshot them from the PDF viewer separately.

Tables with multi-row cells or complex spans

Simple tables transfer cleanly to markdown table syntax. Tables with rowspans, multi-line cells, or footnotes inside cells degrade — markdown's table syntax can't represent those structures. For benchmark tables in ML papers (typically simple rectangular grids), extraction is usually clean.

Algorithm blocks

Papers that use the algorithm or algorithmic LaTeX packages render the pseudocode in a way that's structurally close to code but not identical. Extraction usually returns either a fenced code block or a numbered list. Both are usable; pick the one that reads better in your notes.

Bibliography

References come out as a bulleted or numbered list at the end of the markdown. They're not linked back to in-text citations — that's a structural limitation of converting from rendered PDF rather than from the source .bib file.

Extract Any arXiv PDF in Your Browser

Convert any arXiv paper to clean markdown with LaTeX equations preserved. No upload, no account, no retention — your PDF stays on your device.

Open PDF to Markdown Try PDF to Text