How I Debugged an 85MB Astro Dist Down to 25MB

The Starting Point

The dist folder for this blog was sitting at 85MB. That is too large for a static site with no video and no complex interactivity. The site runs on Astro with MDX posts, KaTeX math rendering, and a command palette for search. Nothing in that stack should produce 85MB of build output.

The question was: where is it coming from?

Step 1: Top-Level Breakdown

The first command is always the same. Get a sorted view of what is taking space:

du -sh dist/* | sort -rh | head -20

Output:

19M   dist/images
1.5M  dist/pagefind
1.2M  dist/assets
976K  dist/index.html
840K  dist/posts
625K  dist/post-file-1
625K  dist/post-file-2
...

Two things stood out immediately:

Images: 19MB. Too large for a handful of images.
Every post directory: ~625KB. There are around 110 posts. That is 110 × 625KB = ~68MB from post HTML alone.

The 19MB was explainable: I had exported PDF pages as PNGs using pdftoppm at 200dpi and dropped them directly into public/images/ without compression. A4 pages at 200dpi produce 6-7MB PNGs each.

The 625KB per post was not explainable. A post is text and math. It should be 50-100KB at most.

Step 2: Anatomy of a Post Page

du -sh tells you which directories are large but not why. A built Astro page is a single index.html inside a directory named after the post slug. That file can be bloated by three things: markup, inline styles, or inline scripts. The browser receives all three as one file on the first request, so any one of them being large affects every page load.

To find out which was responsible, I wrote a small Python script that reads the compiled HTML for one post and measures each <script> tag individually:

import re

with open('dist/your-post-slug/index.html', 'r') as f:
    content = f.read()

scripts = re.findall(r'<script([^>]*)>(.*?)</script>', content, re.DOTALL)

print(f'Total: {len(content)/1024:.1f} KB')
for attrs, body in scripts:
    print(f'  {len(body)/1024:.1f}KB  {attrs.strip()!r}')

The goal is to get a per-script size breakdown of the page. The script reads the HTML as a string, then extracts every <script>...</script> block along with its attributes. It prints the total file size first, then one line per script showing its size in KB and its attributes. The attributes tell you what kind of script it is: type="module" is a deferred ES module, an empty string is a plain inline script that runs synchronously. This immediately tells you which script is oversized and whether it is framework code, a module, or something else.

The regex re.findall(r'<script([^>]*)>(.*?)</script>', content, re.DOTALL) captures two groups per match: everything inside the opening tag (the attributes), and everything between the tags (the body). re.DOTALL is needed because script bodies span multiple lines and the default . in Python regex does not match newlines without it. len(content)/1024 converts bytes to kilobytes for readable output.

Output:

Total: 624.6 KB
  0.3KB   ''
  0.5KB   'type="module"'
  509.5KB ''
  0.4KB   'type="module"'
  0.8KB   'type="module"'
  2.8KB   'type="module"'
  1.5KB   'type="module"'

509KB in a single inline script with no attributes. All the type="module" scripts are small and expected. The plain inline script had no business being 509KB.

Step 3: What Was Inside the Script

To read the content without printing half a megabyte to the terminal, I filtered to scripts over 100KB and printed just the first 300 characters:

big = [body for attrs, body in scripts if len(body) > 100000][0]
print(big[:300])

The goal is to identify what the large script actually contains without flooding the terminal. Taking only the first 300 characters is enough to see the variable names and data structure at the top of the script. The list comprehension walks the scripts list from step 2, keeps only entries where the body exceeds 100,000 characters, and [0] grabs the first result. big[:300] slices just the opening of that script.

The output started with something like:

(function(){const postData = [{"title":"Post Title One","slug":"post-slug-one",
"description":"...","body":"the full stripped text of the post content..."},
{"title":"Post Title Two", ...

Every post page contained an inline JSON blob with the full text content of every post on the site. Title, slug, tags, description, and the full stripped body of all 110 posts, serialised and embedded in a <script> tag.

Step 4: Finding the Source

I opened src/components/CommandPalette.astro. The command palette is a Cmd+K search overlay that lets you search across all posts. It was built by passing all posts as a prop and serialising them via Astro’s define:vars:

const postData = posts.map((p) => ({
  title: p.data.title,
  slug: p.data.slug,
  status: p.data.status,
  type: p.data.type,
  tags: p.data.tags,
  description: p.data.description ?? '',
  body: (p.body ?? '')
    .replace(/---[\s\S]*?---/, '')
    .replace(/```[\s\S]*?```/g, '')
    .replace(/[#*_~`>\[\]]/g, '')
    .replace(/\s+/g, ' ')
    .trim(),
  date: `...`,
}));

The body field was stripping frontmatter, code blocks, and markdown syntax from every post, then embedding the result inline. This data was used for full-text search in the palette’s input handler:

const matched = postData.filter((p) =>
  p.title.toLowerCase().includes(q) ||
  p.body.toLowerCase().includes(q) ||   // ← this line
  ...
);

Astro’s define:vars serialises the variable as JSON and inlines it into a <script> tag on every page the component is used on. Since CommandPalette was mounted in the layout, it appeared on all 110 pages.

110 posts × average body ~4.5KB stripped = ~495KB of JSON, inlined 110 times = ~54MB from this one field alone.

The Fix

Remove body from postData. The command palette already searches on title, tags, type, status, description, and date. Full body search is useful, but inlining the entire corpus into every page is the wrong way to do it. The right approach is a dedicated search index fetched once on demand, which is exactly what the pagefind index already provides. Duplicating that data as a per-page inline script adds cost with no benefit.

const postData = posts.map((p) => ({
  title: p.data.title,
  slug: p.data.slug,
  status: p.data.status,
  type: p.data.type,
  tags: p.data.tags,
  description: p.data.description ?? '',
  date: `...`,
}));

Remove the p.body line from the filter. Rebuild.

Result: 625KB per page → 172KB per page. 85MB dist → 36MB dist.

Step 5: Images

The remaining 36MB still had 19MB in images. The cheatsheet PNGs were the main offenders: 6.5MB and 6.7MB each.

du -sh public/images/* | sort -rh

6.7M  image-a-2.png
6.5M  image-a-1.png
3.1M  image-b-1.png
3.1M  image-b-2.png

These were A4 page scans exported from PDF at 200dpi with no compression. The fix: convert to JPEG at 65% quality using sips, the macOS built-in image processing tool:

sips -s format jpeg -s formatOptions 65 image-a-1.png --out image-a-1.jpg
# 6.5M → 1.4M

sips -s format jpeg -s formatOptions 65 image-a-2.png --out image-a-2.jpg
# 6.7M → 1.2M

Update the image references in the post from .png to .jpg, delete the originals, rebuild.

Result: 36MB → 28MB. The PNGs were kept locally outside of public/ as originals before deletion.

Final Numbers

Stage	Dist size	Per-page size
Start	85MB	625KB
After removing body from postData	36MB	172KB
After JPEG conversion	28MB	172KB

What to Check on Any Static Site

When a static site build output is unexpectedly large, the debugging order is:

du -sh dist/* | sort -rh — find the largest directories and files.
If post pages are large, analyse the HTML with a script: count script tag sizes, style tag sizes, and HTML markup separately.
If an inline script is large, print its first 300 characters to identify what it is.
Search the source for define:vars — any variable passed this way gets serialised and inlined on every page it appears.
Check public/ for uncompressed images. Astro copies everything in public/ to dist with no optimisation. Use astro:assets and the <Image /> component for automatic WebP conversion, or compress manually before putting files in public/.

The pagefind search index (1.5MB) and KaTeX fonts in assets/ (1.2MB) are both expected and shared. They are not the problem on a well-built Astro site. The problems are always either images or something being inlined per-page that should not be.