* fix: handle deeply nested HTML that triggers RecursionError (#1636)
Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.
This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.
Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML
* refactor: address review feedback on RecursionError fallback
- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
sys.setrecursionlimit(200) instead of relying on depth=500 being
sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
fallback and let RecursionError propagate to the caller
* test: use result.markdown instead of deprecated result.text_content
---------
Co-authored-by: jigangz <jigangz@github.com>
* Fix O(n) memory growth in PDF conversion by calling page.close() after each page
* Refactor PDF memory optimization tests for improved readability and consistency
* Add memory benchmarking tests for PDF conversion with page.close() fix
* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code
* Bump version to 0.1.6b2 in __about__.py
* Update PDF conversion tests to include mimetype in StreamInfo
* Add OCR test data and implement tests for various document formats
- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
* Enhance OCR functionality and validation in document converters
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.
* Add support for scanned PDFs with full-page OCR fallback and implement tests
* Bump version to 0.1.6b1 in __about__.py
* Refactor OCR services to support LLM Vision, update README and tests accordingly
* Add OCR-enabled converters and ensure consistent OCR format across document types
* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters
* Refactor exception imports for consistency across converters and tests
* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling
* Bump version to 0.1.6b1 in __about__.py
* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing
* Add comprehensive OCR test suite for various document formats
- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.
* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency
- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.
* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs
* Revert
* Revert
* Update REDMEs
* Refactor import statements for consistency and improve formatting in converter and test files
* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests
* Fix: PDF parsing doesn't support partially numbered lists
* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file
* Refactor: Improve assertion formatting in partial numbering tests
* Added PDF table extraction feature with aligned Markdown (#1419)
* Add PDF test files and enhance extraction tests
- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.
* fix: update dependencies for PDF processing and improve table extraction logic
* Bumped version of pdfminer.six
---------
Authored-by: Ashok <ashh010101@gmail.com>
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ
* Update the Dockerfile to enable plugins. No puglins are installed by default.
* refactor: remove unused imports
* fix: replace NotImplemented with NotImplementedError
* refactor: resolve E722 (do not use bare 'except')
* refactor: remove unused variable
* refactor: remove unused imports
* refactor: ignore unused imports that will be used in the future
* refactor: resolve W293 (blank line contains whitespace)
* refactor: resolve F541 (f-string is missing placeholders)
---------
Co-authored-by: afourney <adamfo@microsoft.com>
* feat: Add CSV to Markdown table converter
- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class
----
Thanks also to @benny123tw who submitted a very similar PR in #1171
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
* Added an initial minimal MCP server for MarkItDown
* Added STDIO default option.
* Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop
* Pin mcp version.
* optional reserve base64 string in markdown _CustomMarkdownify and pptx
* add other converter para support
* fix linter
* Use *kwarg to pass keep_data_uri para.
* Add module cli vector tests
* Fixed formatting, and adjusted tests.
Adjusts warning filters to be more contextual
Updates dependencies for magika and youtube-transcript-api
Updates the version to 0.1.0a5 in __about__.py