Convert Existing Code to Specs
codespeak takeover reads existing source code and extracts a spec from it. From that point on, you maintain the spec instead of the code โ editing the spec and rebuilding to propagate changes.
This tutorial takes over a real file from Microsoft MarkItDown, then fixes a real GitHub issue (#1468) by editing the generated spec.
Prerequisites: Complete the Installation steps first.
Set up the project
Clone MarkItDown and initialize CodeSpeak:
git clone git@github.com:microsoft/markitdown.git
cd markitdown
codespeak initRun takeover
The Outlook MSG converter is at packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py. Extract a spec from it:
codespeak takeover packages/markitdown/src/markitdown/converters/_outlook_msg_converter.pyThis creates _outlook_msg_converter.cs.md next to the original file and registers it in codespeak.json.
Review the generated spec
Here's a fragment of the generated spec:
# Outlook MSG Converter Specification
The Outlook MSG converter transforms Microsoft Outlook `.msg` email
files into structured markdown documents by parsing the OLE file format
and extracting email metadata and content.
## Conversion Process
### Email Metadata Extraction
Extracts standard email headers from specific OLE streams:
- **From**: Stream `__substg1.0_0C1F001F`
- **To**: Stream `__substg1.0_0E04001F`
- **Subject**: Stream `__substg1.0_0037001F`
### Email Body Content
Extracts message body from stream `__substg1.0_1000001F`.
...The spec captures what the converter does โ which streams it reads, how it validates files, how it formats output โ without reproducing the Python code.
Fix a bug by editing the spec
Issue #1468 reports that Cc, Bcc, Date, and attachments are missing. Instead of editing code, add them to the spec:
Extracts standard email headers from specific OLE streams:
- **From**: Stream `__substg1.0_0C1F001F`
- **To**: Stream `__substg1.0_0E04001F`
+- **Cc**: Stream `__substg1.0_0E03001F`
+- **Bcc**: Stream `__substg1.0_0E02001F`
- **Subject**: Stream `__substg1.0_0037001F`
+- **Date**: Stream `__substg1.0_0039001F` (client submit time, human-readable format)
+
+### Attachments
+
+Enumerates attachment sub-storages (`__attach_version1.0_#XXXXXXXX`)
+and extracts metadata for each:
+- **Filename**: Stream `__substg1.0_3707001F` (display name),
+ falling back to `__substg1.0_3704001F` (filename)
+- **Size**: Determined from the length of the attachment data
+ stream `__substg1.0_37010102`
+
+Attachment content is not decoded or included โ only metadata is listed.Update the output format section:
**From:** [sender address]
**To:** [recipient address]
+**Cc:** [cc addresses]
+**Bcc:** [bcc addresses]
**Subject:** [email subject]
+**Date:** [sent date/time]
## Content
[email body content]
+
+## Attachments
+
+- [filename] (size in human-readable format)
+- ...Add a test input so CodeSpeak has a concrete example to verify against:
+## Test inputs
+
+- test_files/unicode.msg is expected to have a TIF attachmentBuild
codespeak buildCodeSpeak added Cc, Bcc, Date extraction and a full attachments section to the converter:
headers = {
"From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
"To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
+ "Cc": self._get_stream_data(msg, "__substg1.0_0E03001F"),
+ "Bcc": self._get_stream_data(msg, "__substg1.0_0E02001F"),
"Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
+ "Date": self._get_stream_data(msg, "__substg1.0_0039001F"),
}It also generated test vectors verifying the output:
FileTestVector(
filename="unicode.msg",
must_include=[
"**From:** brizhou@gmail.com",
"**Cc:** Brian Zhou",
"**Subject:** Test for TIF files",
"## Attachments",
".tif",
"(946.9 KB)",
],
),The spec change of +23/-3 lines generated +221/-25 lines of code (~10x).
Next steps
- Improve test coverage โ use
codespeak coverageto automatically add missing tests - Imports / Dependencies โ split taken-over code into multiple specs
codespeak takeoverreference โ full command reference