About MarkItDown
On December 14, 2024, Microsoft released something interesting.
It’s a software called MarkItDown.
It’s a tool that converts various files into the Markdown format.
Quoting from the official repository, the formats that can be converted are:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
As you can see, it handles a wide variety of files.
What Seems Useful…
There’s a passage in the Bible: “God said, ‘Let there be light,’ and there was light.”
Nowadays, there is something that is not God but is treated like one: AI.
“AI said, ‘Let there be data,’ and there was information”—that’s the situation now.
When preparing data for AI, not everything is saved as text.
We need to have it in a format that AI can read, and that’s where MarkItDown comes in.
Converting documents saved as PDFs into Markdown will make AI happy too.
I tried using it a bit for work.
Practical Use
I have published the code in this repository.
By using a tool called uv, it is designed to make environment setup as easy as possible.
Here’s the content of the program I have published:
from pathlib import Path
from markitdown import MarkItDown
def convert_pdfs_to_markdown(src: str, dest: str):
"""
Converts PDF files in the specified directory to Markdown format and saves them in the output directory.
Args:
src (str): Directory where the input PDF files are stored.
dest (str): Output directory for the converted Markdown files.
"""
# Set the paths for the input and output directories
src_dir = Path(src)
dest_dir = Path(dest)
# Create the output directory if it doesn't exist
dest_dir.mkdir(parents=True, exist_ok=True)
# Search for PDF files and convert them to Markdown
for pdf_file in src_dir.glob("*.pdf"):
mid = MarkItDown()
result = mid.convert(str(pdf_file))
# Set the output file name
md_file = dest_dir / (pdf_file.stem + ".md")
# Save as a Markdown file
with open(md_file, "w", encoding="utf8") as f:
f.write(result.text_content)
print(f"Conversion complete: {pdf_file.name} -> {md_file.name}")
# Example usage
if __name__ == "__main__":
src_dir = "./src" # Input directory
dest_dir = "./dest" # Output directory
convert_pdfs_to_markdown(src=src_dir, dest=dest_dir)
It’s not a very complicated program.
The important parts are:
result = mid.convert(str(pdf_file))
for the conversion processresult.text_content
to extract the content of the text
These two points.
Conclusion
It may not be a program that I use frequently, but using new tools is always an enjoyable experience.
Writing articles like this one bit by bit might also be fun.