Convert Existing Code to Specs

codespeak takeover reads existing source code and extracts a spec from it. From that point on, you maintain the spec instead of the code โ€” editing the spec and rebuilding to propagate changes.

This tutorial takes over a real file from Microsoft MarkItDown, then fixes a real GitHub issue (#1468) by editing the generated spec.

Prerequisites: Complete the Installation steps first.

Set up the project

Clone MarkItDown and initialize CodeSpeak:

git clone git@github.com:microsoft/markitdown.git
cd markitdown
codespeak init

Run takeover

The Outlook MSG converter is at packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py. Extract a spec from it:

codespeak takeover packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py

This creates _outlook_msg_converter.cs.md next to the original file and registers it in codespeak.json.

Review the generated spec

Here's a fragment of the generated spec:

# Outlook MSG Converter Specification

The Outlook MSG converter transforms Microsoft Outlook `.msg` email
files into structured markdown documents by parsing the OLE file format
and extracting email metadata and content.

## Conversion Process

### Email Metadata Extraction

Extracts standard email headers from specific OLE streams:
- **From**: Stream `__substg1.0_0C1F001F`
- **To**: Stream `__substg1.0_0E04001F`
- **Subject**: Stream `__substg1.0_0037001F`

### Email Body Content

Extracts message body from stream `__substg1.0_1000001F`.

...

The spec captures what the converter does โ€” which streams it reads, how it validates files, how it formats output โ€” without reproducing the Python code.

Fix a bug by editing the spec

Issue #1468 reports that Cc, Bcc, Date, and attachments are missing. Instead of editing code, add them to the spec:

 Extracts standard email headers from specific OLE streams:
 - **From**: Stream `__substg1.0_0C1F001F`
 - **To**: Stream `__substg1.0_0E04001F`
+- **Cc**: Stream `__substg1.0_0E03001F`
+- **Bcc**: Stream `__substg1.0_0E02001F`
 - **Subject**: Stream `__substg1.0_0037001F`
+- **Date**: Stream `__substg1.0_0039001F` (client submit time, human-readable format)
+
+### Attachments
+
+Enumerates attachment sub-storages (`__attach_version1.0_#XXXXXXXX`)
+and extracts metadata for each:
+- **Filename**: Stream `__substg1.0_3707001F` (display name),
+  falling back to `__substg1.0_3704001F` (filename)
+- **Size**: Determined from the length of the attachment data
+  stream `__substg1.0_37010102`
+
+Attachment content is not decoded or included โ€” only metadata is listed.

Update the output format section:

 **From:** [sender address]
 **To:** [recipient address]
+**Cc:** [cc addresses]
+**Bcc:** [bcc addresses]
 **Subject:** [email subject]
+**Date:** [sent date/time]

 ## Content

 [email body content]
+
+## Attachments
+
+- [filename] (size in human-readable format)
+- ...

Add a test input so CodeSpeak has a concrete example to verify against:

+## Test inputs
+
+- test_files/unicode.msg is expected to have a TIF attachment

Build

codespeak build

CodeSpeak added Cc, Bcc, Date extraction and a full attachments section to the converter:

 headers = {
     "From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
     "To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
+    "Cc": self._get_stream_data(msg, "__substg1.0_0E03001F"),
+    "Bcc": self._get_stream_data(msg, "__substg1.0_0E02001F"),
     "Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
+    "Date": self._get_stream_data(msg, "__substg1.0_0039001F"),
 }

It also generated test vectors verifying the output:

FileTestVector(
    filename="unicode.msg",
    must_include=[
        "**From:** brizhou@gmail.com",
        "**Cc:** Brian Zhou",
        "**Subject:** Test for TIF files",
        "## Attachments",
        ".tif",
        "(946.9 KB)",
    ],
),

The spec change of +23/-3 lines generated +221/-25 lines of code (~10x).

Next steps