AI Research
arXiv PDF

Extract Equations From the DeepMind Proof-Search Paper PDF

DeepMind released a formal proof-search paper on arXiv. It's roughly 30 pages of LaTeX-rendered equations, algorithm blocks, and benchmark tables. Researchers and ML engineers reading it on HN and Twitter want to copy specific equations or code samples into their own notes. Naive copy-paste from a PDF mangles equations and indentation. Here's the fix: extract equations from any arXiv PDF to clean markdown with LaTeX preserved, in your browser, in under a minute.

May 24, 2026
8 min read
PDF Mavericks Team

Why this paper, why now

DeepMind's recent formal proof-search paper is the kind of release that lights up Hacker News and ML Twitter for 72 hours, then becomes a reference document people cite for months. The paper describes the approach, the architecture, the benchmark numbers — read the paper itself on arXiv for the substance. This post is not a paper summary. It's a workflow for getting the equations and pseudocode out of the PDF and into your own notes without losing math fidelity.

The same workflow applies to any arXiv PDF — the AlphaFold updates, the Llama technical reports, every new RLHF and scaling-laws paper that drops. arXiv is a single, predictable PDF format compiled from LaTeX. Once you have a clean extraction pipeline, every paper that lands becomes a 60-second job to file into your notes.

Step 1: Get the arXiv PDF

arXiv URLs follow a fixed pattern. For a paper with the abstract page at:

https://arxiv.org/abs/XXXX.XXXXX

The PDF lives at:

https://arxiv.org/pdf/XXXX.XXXXX

Same numeric identifier, swap abs for pdf. The PDF download is the version the authors compiled and submitted, which is also the version cited by everyone else. Optionally, you can grab the LaTeX source by switching pdf to e-print — but for most extraction work the compiled PDF is what you want, because that's what the equation numbering and figure references align to.

Step 2: Why naive copy-paste from a PDF breaks math

Try this with any arXiv PDF: open it in your browser's built-in PDF viewer, select an equation, copy, paste into a text editor. What lands in your clipboard is usually one of these three failure modes:

  • A jumbled string of Greek letters and operators with no spacing — "Lθ=Es,a∼πθrs,a−βDKLπθ∥πref" instead of a parseable equation.
  • The visible Unicode characters with subscripts and superscripts flattened onto the baseline, losing the structural relationships entirely.
  • The right characters in the right order, but with PDF's internal ligature codes mixed in — invisible to humans, fatal to any downstream renderer.

The root cause: PDFs encode equations as glyph sequences positioned in 2D, not as LaTeX source. The math viewer reconstructs the visual layout; the clipboard copy gets the glyph sequence with no semantic structure.

A proper markdown extraction pipeline does the reverse — it reads the glyph sequence, recognizes math regions by font and positioning, and reconstructs LaTeX. The output looks like this:

$$
\mathcal{L}(\theta) = \mathbb{E}_{(s,a)\sim\pi_\theta}\big[r(s,a) - \beta \, D_{KL}(\pi_\theta \| \pi_{\text{ref}})\big]
$$

That's a math fence ready to drop into a MathJax-enabled markdown viewer. Same equation, but now it's re-typesettable, searchable as LaTeX tokens, and roundtrips back to a rendered display without loss.

Step 3: Extract clean markdown in your browser

Open our PDF to Markdown converter. The page loads a WebAssembly build of a PDF parser plus a markdown serializer. Drop the arXiv PDF onto the upload zone. Three things happen inside the browser tab:

  1. The PDF is parsed page by page. Text, font metadata, and positional information come out together.
  2. Math regions are identified by font family (papers compiled from LaTeX use Computer Modern, Latin Modern, or stix-math fonts) and converted back to LaTeX tokens inside $...$ for inline and $$...$$ for display.
  3. Code blocks are identified by font family (typically Computer Modern Typewriter or stix-mono) and fenced with triple backticks.

A typical 30-page paper finishes in 3–5 seconds on a modern laptop. The output is a single markdown document with sections, equations, algorithm blocks, and tables preserved. Code blocks come out like this:

def search(state, depth):
    if depth == 0 or is_terminal(state):
        return value(state)
    best = -inf
    for action in legal_actions(state):
        v = -search(apply(state, action), depth - 1)
        best = max(best, v)
    return best

Nothing about the file leaves your browser. The PDF is read into memory, processed locally, and the markdown is rendered into the same tab. No upload, no server-side copy, no logged request body. Useful when you're reading a preprint that's still under embargo, or when you just don't want a third party tracking which papers you read.

Step 4: Use the markdown in Obsidian, Notion, or research notes

Obsidian

Save the extracted markdown as a .md file inside your vault, ideally under a Papers/ folder organized by year or by topic. Enable MathJax in Obsidian's settings (it's on by default in recent versions). Equations render in preview mode automatically. Obsidian's full-text search indexes the raw LaTeX, so you can search for D_{KL} across every paper you've ever filed and find every paper that uses KL divergence.

Tag the file with the arXiv ID in YAML frontmatter so backlinks stay stable when you rename. A minimal frontmatter block:

---
arxiv: 2401.XXXXX
authors: [DeepMind]
topic: proof-search
read: 2026-05-24
---

Notion

Notion's markdown importer accepts most fenced blocks cleanly — code blocks, headings, lists, and tables transfer without surgery. Math is the rough edge: Notion uses its own inline-equation and block-equation primitives rather than rendering $$...$$ directly. The cleanest workflow is to import the markdown for the prose and structure, then convert each math block to a Notion equation block manually. Slow on a 30-page paper, but only needs to be done once per file.

Plain markdown + static site

For a personal research blog or a Hugo/Jekyll/Astro site, the extracted markdown drops in as-is. Enable MathJax or KaTeX in the site template, and the equations render in the browser at view time. The same .md file becomes the source of truth for your local notes and the published version on your site — no separate copies, no divergence.

If text-only is enough

When you don't need equations or code-block structure preserved — for example, when you're feeding the paper into a summarization workflow that only cares about prose — our PDF to Text converter is faster and produces a smaller output. Use markdown extraction when structure matters, plain text when it doesn't.

Known limits and edge cases

Custom LaTeX macros from the paper's preamble

Author-defined commands like \norm{x} or \indicator[A] extract as their literal command name. Your downstream renderer will need the same macro definitions, or you'll need to substitute the underlying expression. Most arXiv ML papers stick to amsmath, amssymb, and mathtools — those are universally supported.

Multi-line equation alignments

align and align* environments come out as a single $$...$$ block with internal \\ line breaks. MathJax and KaTeX render this correctly, but some lightweight markdown renderers ignore the line breaks. If you see equations collapsing onto one line, your renderer is the issue, not the extraction.

Figures and figure captions

Figures don't extract as images — PDFs encode them as vector graphics or rasterized blobs that aren't usable as standalone files. Captions are extracted as italic text below a [Figure N: omitted] placeholder. To get the figures themselves, snapshot them from the PDF viewer separately.

Tables with multi-row cells or complex spans

Simple tables transfer cleanly to markdown table syntax. Tables with rowspans, multi-line cells, or footnotes inside cells degrade — markdown's table syntax can't represent those structures. For benchmark tables in ML papers (typically simple rectangular grids), extraction is usually clean.

Algorithm blocks

Papers that use the algorithm or algorithmic LaTeX packages render the pseudocode in a way that's structurally close to code but not identical. Extraction usually returns either a fenced code block or a numbered list. Both are usable; pick the one that reads better in your notes.

Bibliography

References come out as a bulleted or numbered list at the end of the markdown. They're not linked back to in-text citations — that's a structural limitation of converting from rendered PDF rather than from the source .bib file.

Extract Any arXiv PDF in Your Browser

Convert any arXiv paper to clean markdown with LaTeX equations preserved. No upload, no account, no retention — your PDF stays on your device.