markitdown

Author	SHA1	Message	Date
jigangz	604bba13da	fix: handle deeply nested HTML that triggers RecursionError (#1644 ) * fix: handle deeply nested HTML that triggers RecursionError (#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML * refactor: address review feedback on RecursionError fallback - Move 'import warnings' to module top level (was inside except block) - Make test environment-independent by temporarily lowering sys.setrecursionlimit(200) instead of relying on depth=500 being sufficient on all platforms; original limit restored in finally block - Add strict=True keyword argument to opt out of the plain-text fallback and let RecursionError propagate to the caller * test: use result.markdown instead of deprecated result.text_content --------- Co-authored-by: jigangz <jigangz@github.com>	2026-04-15 15:26:44 -07:00
afourney	63cbbd9de6	Updated warning about binding to non-local interfaces. (#1653 )	2026-03-30 10:17:52 -07:00
lesyk	a6c8ac46a6	Fix O(n) memory growth in PDF conversion by calling page.close() afte… (#1612 ) * Fix O(n) memory growth in PDF conversion by calling page.close() after each page * Refactor PDF memory optimization tests for improved readability and consistency * Add memory benchmarking tests for PDF conversion with page.close() fix * Remove unnecessary blank lines in PDF memory optimization tests for cleaner code * Bump version to 0.1.6b2 in __about__.py * Update PDF conversion tests to include mimetype in StreamInfo	2026-03-16 10:35:24 -07:00
lesyk	c6308dc822	[MS] Add OCR layer service for embedded images and PDF scans (#1541 ) * Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files	2026-03-10 09:17:17 -07:00
afourney	4a5340f93b	Bump version for release. (#1564 )	2026-02-20 11:40:57 -08:00
Bas Nijholt	6b0fd15e60	Remove onnxruntime<=1.20.1 Windows pin (#1551 )	2026-02-16 15:05:37 -08:00
afourney	2b6ec9f315	Add text/markdown to Accept header (#1554 )	2026-02-13 11:53:01 -08:00
lesyk	c83de14a9c	[MS] Extend table support for wide tables (#1552 ) * feat: enhance PDF table extraction to support complex forms and add new test cases * feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases * fix: correct formatting and improve assertions in PDF table tests	2026-02-13 10:45:39 -08:00
lesyk	7fdaefb724	Fix: PDF parsing doesn't support partially numbered lists (#1525 ) * Fix: PDF parsing doesn't support partially numbered lists * Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file * Refactor: Improve assertion formatting in partial numbering tests	2026-01-08 15:15:22 -08:00
lesyk	251dddcf0c	[MS] Update PDF table extraction to support aligned Markdown (#1499 ) * Added PDF table extraction feature with aligned Markdown (#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com>	2026-01-07 16:38:45 -08:00
afourney	dde250a456	Bump versions of mammoth and pdfminer.six (#1492 ) * Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.	2025-12-01 10:11:24 -08:00
afourney	3d4fe3cdcc	Upgrade mammoth to 1.11.0 (#1452 )	2025-10-20 16:07:39 -07:00
afourney	447c047731	Test if mammoth resolves rlinks. (#1451 )	2025-10-20 15:54:05 -07:00
Meirna	8a9d8f1593	feat: add checkbox support to Markdown converter (#1208 ) This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>	2025-08-26 15:30:47 -07:00
Richard Ye	17365654c9	Handle PPTX shapes where position is None (#1161 ) * Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front	2025-08-26 15:28:17 -07:00
Yuzhong Zhang	59eb60f8cb	fix docx parse error(\n in alt) (#1163 )	2025-08-26 15:20:17 -07:00
Dmitry	459d462f29	docs: correct minor typos (#1173 )	2025-08-26 15:15:23 -07:00
Noah Zhu	c3f6cb356c	Adding support for data-src Attribute (#1226 ) * supportfordata-src	2025-08-26 15:11:53 -07:00
Ebrahim Tayabali	0c4d3945a0	Update README.md (#1191 ) Fix: Subtle spelling mistake fixed.	2025-08-26 15:07:27 -07:00
Utkarsh kumar	f8b60b5403	Update README.md (#1350 ) ISSUE #1339	2025-08-26 15:02:56 -07:00
Stefan Rink	b81a387616	fix: correctly pass custom llm prompt parameter (#1319 ) * fix: correctly pass custom llm prompt parameter	2025-08-26 14:51:10 -07:00
safen0s	ea1a3dfb60	Add HTML support to DocumentIntelligenceConverter (#1352 )	2025-08-26 14:34:43 -07:00
t3tra	fb1ad24833	Ensure safe ExifTool usage: require >= 12.24 (#1399 ) * feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification ---------	2025-08-26 14:25:13 -07:00
JonahDelman	1178c2e211	Fixed documentation typos in _base_converter.py (#1393 )	2025-08-26 14:23:10 -07:00
afourney	9278119bb3	Resolved an issue with linked images in docx [mammoth] (#1405 )	2025-08-26 14:20:29 -07:00
afourney	3bfb821c09	Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273 ) * Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ * Update the Dockerfile to enable plugins. No puglins are installed by default.	2025-06-03 09:35:33 -07:00
Tomasz Kalinowski	62b72284fe	pin onnxruntime on Windows (#1274 ) closes #1266	2025-05-28 13:13:51 -07:00
afourney	1dd3c83339	Promoting 0.1.2a1 to 0.1.2 (#1272 )	2025-05-28 10:04:42 -07:00
afourney	9dc982a3b1	Small changes to favor streamable HTTP over deprecated SSE (#1264 )	2025-05-23 11:39:41 -07:00
afourney	effde4767b	Preparing a pre-release of 0.1.2 (#1260 )	2025-05-21 15:24:56 -07:00
rtpacks	04bf831209	docs: fix typos (#1201 )	2025-05-21 15:12:22 -07:00
Betula-L	9fd680c366	support streamable http mcp (#1245 ) Co-authored-by: luhualin	2025-05-21 14:34:50 -07:00
Yi-Cheng Wang	131f0c7739	feat: add Document Intelligence API version selection via kwargs (#1253 ) Co-authored-by: Yi-Cheng Wang <yicheng.wang@heph-ai.com> Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:22:08 -07:00
JoshClark-git	56f7579ce2	FIX YouTube transcript errors (#1241 ) * FIX YouTube transcript errors * Fixed formatting. --------- Co-authored-by: Josh <jca351@sfu.ca> Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:17:57 -07:00
t3tra	cb421cf9ea	Chore: Make linter happy (#1256 ) * refactor: remove unused imports * fix: replace NotImplemented with NotImplementedError * refactor: resolve E722 (do not use bare 'except') * refactor: remove unused variable * refactor: remove unused imports * refactor: ignore unused imports that will be used in the future * refactor: resolve W293 (blank line contains whitespace) * refactor: resolve F541 (f-string is missing placeholders) --------- Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:02:16 -07:00
kira-offgrid	39e7252940	fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse-packages-markitdown-src-markitdown-converter_utils-docx-math-omml.py (#1251 )	2025-05-21 09:57:21 -07:00
afourney	bbcf876b18	Switched from the stdlib minidom parser to defusedxml. (#1259 )	2025-05-21 09:47:14 -07:00
createcentury	041be54471	Update README.md (#1187 ) updated subtle misspelling.	2025-04-13 09:31:40 -07:00
Turdıbek	8576f1d915	Add CSV to Markdown table conversion - fixes #1144 (#1176 ) * feat: Add CSV to Markdown table converter - Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class ---- Thanks also to @benny123tw who submitted a very similar PR in #1171	2025-04-13 09:19:00 -07:00
Sathindu	3fcd48cdfc	feat: render math equations in .docx documents (#1160 ) * feat: math equation rendering in .docx files * fix: import fix on .docx pre processing * test: add test cases for docx equation rendering * docs: add ThirdPartyNotices.md * refactor: reformatted with black	2025-03-28 15:36:38 -07:00
afourney	9e067c42b6	Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151 ) * Make it easier to use AzureKeyCredentials with Azure Doc Intelligence * Fixed mypy type error. * Added more fine-grained options over types. * Pass doc intel options further up the stack.	2025-03-26 10:44:11 -07:00
afourney	73b9d57312	Update badges (#1157 ) * Update badges in subpackages.	2025-03-25 14:52:24 -07:00
afourney	3ca57986ef	Basic SSE MCP Server for MarkItDown (#1155 ) * Added an initial minimal MCP server for MarkItDown * Added STDIO default option. * Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop * Pin mcp version.	2025-03-25 14:38:22 -07:00
afourney	c1f9a323ee	Bump version. (#1154 )	2025-03-24 23:26:30 -07:00
afourney	e928b43afb	convert_url renamed to convert_uri, and now handles data and file URIs (#1153 )	2025-03-24 21:43:04 -07:00
afourney	2ffe6ea591	Bump version. (#1150 )	2025-03-22 11:21:32 -07:00
afourney	efc55b260d	Bump version and resolve a console encoding error. (#1149 )	2025-03-21 09:27:25 -07:00
Yuzhong Zhang	52432bd228	Add support for preserving base64 encoded images (#1140 ) * optional reserve base64 string in markdown _CustomMarkdownify and pptx * add other converter para support * fix linter * Use kwarg to pass keep_data_uri para. Add module cli vector tests * Fixed formatting, and adjusted tests.	2025-03-20 18:50:23 -07:00
afourney	c0a511ecff	Updated docx file to include an image. (#1146 )	2025-03-20 12:25:56 -07:00
afourney	cd6aa41361	Adjust warning filters and update dependencies (#1143 ) Adjusts warning filters to be more contextual Updates dependencies for magika and youtube-transcript-api Updates the version to 0.1.0a5 in __about__.py	2025-03-19 22:09:14 -07:00

1 2

96 Commits