Extract Equations From the DeepMind Proof-Search Paper PDF
DeepMind released a formal proof-search paper on arXiv. It's roughly 30 pages of LaTeX-rendered equations, algorithm blocks, and benchmark tables. Researchers and ML engineers reading it on HN and Twitter want to copy specific equations or code samples into their own notes. Naive copy-paste from a PDF mangles equations and indentation. Here's the fix: extract equations from any arXiv PDF to clean markdown with LaTeX preserved, in your browser, in under a minute.
Why this paper, why now
DeepMind's recent formal proof-search paper is the kind of release that lights up Hacker News and ML Twitter for 72 hours, then becomes a reference document people cite for months. The paper describes the approach, the architecture, the benchmark numbers — read the paper itself on arXiv for the substance. This post is not a paper summary. It's a workflow for getting the equations and pseudocode out of the PDF and into your own notes without losing math fidelity.
The same workflow applies to any arXiv PDF — the AlphaFold updates, the Llama technical reports, every new RLHF and scaling-laws paper that drops. arXiv is a single, predictable PDF format compiled from LaTeX. Once you have a clean extraction pipeline, every paper that lands becomes a 60-second job to file into your notes.
Step 1: Get the arXiv PDF
arXiv URLs follow a fixed pattern. For a paper with the abstract page at:
The PDF lives at:
Same numeric identifier, swap abs for pdf. The PDF download is the version the authors compiled and submitted, which is also the version cited by everyone else. Optionally, you can grab the LaTeX source by switching pdf to e-print — but for most extraction work the compiled PDF is what you want, because that's what the equation numbering and figure references align to.
v2, v3 appended pin you to a specific revision. If you're citing an equation by number, pin the version — the numbering can shift between revisions.Step 2: Why naive copy-paste from a PDF breaks math
Try this with any arXiv PDF: open it in your browser's built-in PDF viewer, select an equation, copy, paste into a text editor. What lands in your clipboard is usually one of these three failure modes:
- A jumbled string of Greek letters and operators with no spacing — "Lθ=Es,a∼πθrs,a−βDKLπθ∥πref" instead of a parseable equation.
- The visible Unicode characters with subscripts and superscripts flattened onto the baseline, losing the structural relationships entirely.
- The right characters in the right order, but with PDF's internal ligature codes mixed in — invisible to humans, fatal to any downstream renderer.
The root cause: PDFs encode equations as glyph sequences positioned in 2D, not as LaTeX source. The math viewer reconstructs the visual layout; the clipboard copy gets the glyph sequence with no semantic structure.
A proper markdown extraction pipeline does the reverse — it reads the glyph sequence, recognizes math regions by font and positioning, and reconstructs LaTeX. The output looks like this:
$$
\mathcal{L}(\theta) = \mathbb{E}_{(s,a)\sim\pi_\theta}\big[r(s,a) - \beta \, D_{KL}(\pi_\theta \| \pi_{\text{ref}})\big]
$$That's a math fence ready to drop into a MathJax-enabled markdown viewer. Same equation, but now it's re-typesettable, searchable as LaTeX tokens, and roundtrips back to a rendered display without loss.
Step 3: Extract clean markdown in your browser
Open our PDF to Markdown converter. The page loads a WebAssembly build of a PDF parser plus a markdown serializer. Drop the arXiv PDF onto the upload zone. Three things happen inside the browser tab:
- The PDF is parsed page by page. Text, font metadata, and positional information come out together.
- Math regions are identified by font family (papers compiled from LaTeX use Computer Modern, Latin Modern, or stix-math fonts) and converted back to LaTeX tokens inside
$...$for inline and$$...$$for display. - Code blocks are identified by font family (typically Computer Modern Typewriter or stix-mono) and fenced with triple backticks.
A typical 30-page paper finishes in 3–5 seconds on a modern laptop. The output is a single markdown document with sections, equations, algorithm blocks, and tables preserved. Code blocks come out like this:
def search(state, depth):
if depth == 0 or is_terminal(state):
return value(state)
best = -inf
for action in legal_actions(state):
v = -search(apply(state, action), depth - 1)
best = max(best, v)
return bestNothing about the file leaves your browser. The PDF is read into memory, processed locally, and the markdown is rendered into the same tab. No upload, no server-side copy, no logged request body. Useful when you're reading a preprint that's still under embargo, or when you just don't want a third party tracking which papers you read.
Step 4: Use the markdown in Obsidian, Notion, or research notes
Obsidian
Save the extracted markdown as a .md file inside your vault, ideally under a Papers/ folder organized by year or by topic. Enable MathJax in Obsidian's settings (it's on by default in recent versions). Equations render in preview mode automatically. Obsidian's full-text search indexes the raw LaTeX, so you can search for D_{KL} across every paper you've ever filed and find every paper that uses KL divergence.
Tag the file with the arXiv ID in YAML frontmatter so backlinks stay stable when you rename. A minimal frontmatter block:
--- arxiv: 2401.XXXXX authors: [DeepMind] topic: proof-search read: 2026-05-24 ---
Notion
Notion's markdown importer accepts most fenced blocks cleanly — code blocks, headings, lists, and tables transfer without surgery. Math is the rough edge: Notion uses its own inline-equation and block-equation primitives rather than rendering $$...$$ directly. The cleanest workflow is to import the markdown for the prose and structure, then convert each math block to a Notion equation block manually. Slow on a 30-page paper, but only needs to be done once per file.
Plain markdown + static site
For a personal research blog or a Hugo/Jekyll/Astro site, the extracted markdown drops in as-is. Enable MathJax or KaTeX in the site template, and the equations render in the browser at view time. The same .md file becomes the source of truth for your local notes and the published version on your site — no separate copies, no divergence.
If text-only is enough
When you don't need equations or code-block structure preserved — for example, when you're feeding the paper into a summarization workflow that only cares about prose — our PDF to Text converter is faster and produces a smaller output. Use markdown extraction when structure matters, plain text when it doesn't.
Known limits and edge cases
Custom LaTeX macros from the paper's preamble
Author-defined commands like \norm{x} or \indicator[A] extract as their literal command name. Your downstream renderer will need the same macro definitions, or you'll need to substitute the underlying expression. Most arXiv ML papers stick to amsmath, amssymb, and mathtools — those are universally supported.
Multi-line equation alignments
align and align* environments come out as a single $$...$$ block with internal \\ line breaks. MathJax and KaTeX render this correctly, but some lightweight markdown renderers ignore the line breaks. If you see equations collapsing onto one line, your renderer is the issue, not the extraction.
Figures and figure captions
Figures don't extract as images — PDFs encode them as vector graphics or rasterized blobs that aren't usable as standalone files. Captions are extracted as italic text below a [Figure N: omitted] placeholder. To get the figures themselves, snapshot them from the PDF viewer separately.
Tables with multi-row cells or complex spans
Simple tables transfer cleanly to markdown table syntax. Tables with rowspans, multi-line cells, or footnotes inside cells degrade — markdown's table syntax can't represent those structures. For benchmark tables in ML papers (typically simple rectangular grids), extraction is usually clean.
Algorithm blocks
Papers that use the algorithm or algorithmic LaTeX packages render the pseudocode in a way that's structurally close to code but not identical. Extraction usually returns either a fenced code block or a numbered list. Both are usable; pick the one that reads better in your notes.
Bibliography
References come out as a bulleted or numbered list at the end of the markdown. They're not linked back to in-text citations — that's a structural limitation of converting from rendered PDF rather than from the source .bib file.
Extract Any arXiv PDF in Your Browser
Convert any arXiv paper to clean markdown with LaTeX equations preserved. No upload, no account, no retention — your PDF stays on your device.