About MarkItDown

On December 14, 2024, Microsoft released something interesting.

It’s a software called MarkItDown.

It’s a tool that converts various files into the Markdown format.

Quoting from the official repository, the formats that can be converted are:

- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)

As you can see, it handles a wide variety of files.

What Seems Useful…

There’s a passage in the Bible: “God said, ‘Let there be light,’ and there was light.”

Nowadays, there is something that is not God but is treated like one: AI.

“AI said, ‘Let there be data,’ and there was information”—that’s the situation now.

When preparing data for AI, not everything is saved as text.

We need to have it in a format that AI can read, and that’s where MarkItDown comes in.

Converting documents saved as PDFs into Markdown will make AI happy too.

I tried using it a bit for work.

Practical Use

I have published the code in this repository.

By using a tool called uv, it is designed to make environment setup as easy as possible.

Here’s the content of the program I have published:

from pathlib import Path
from markitdown import MarkItDown


def convert_pdfs_to_markdown(src: str, dest: str):
    """
    Converts PDF files in the specified directory to Markdown format and saves them in the output directory.

    Args:
        src (str): Directory where the input PDF files are stored.
        dest (str): Output directory for the converted Markdown files.
    """
    # Set the paths for the input and output directories
    src_dir = Path(src)
    dest_dir = Path(dest)

    # Create the output directory if it doesn't exist
    dest_dir.mkdir(parents=True, exist_ok=True)

    # Search for PDF files and convert them to Markdown
    for pdf_file in src_dir.glob("*.pdf"):
        mid = MarkItDown()
        result = mid.convert(str(pdf_file))

        # Set the output file name
        md_file = dest_dir / (pdf_file.stem + ".md")

        # Save as a Markdown file
        with open(md_file, "w", encoding="utf8") as f:
            f.write(result.text_content)
        print(f"Conversion complete: {pdf_file.name} -> {md_file.name}")


# Example usage
if __name__ == "__main__":
    src_dir = "./src"  # Input directory
    dest_dir = "./dest"  # Output directory
    convert_pdfs_to_markdown(src=src_dir, dest=dest_dir)

It’s not a very complicated program.

The important parts are:

  1. result = mid.convert(str(pdf_file)) for the conversion process
  2. result.text_content to extract the content of the text

These two points.

Conclusion

It may not be a program that I use frequently, but using new tools is always an enjoyable experience.

Writing articles like this one bit by bit might also be fun.