307 Commits

Author SHA1 Message Date
afourney a51f725d7f Clarify security posture in READMEs (#1807) 2026-04-20 10:52:17 -07:00
jigangz 604bba13da fix: handle deeply nested HTML that triggers RecursionError (#1644)
* fix: handle deeply nested HTML that triggers RecursionError (#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>
2026-04-15 15:26:44 -07:00
afourney 63cbbd9de6 Updated warning about binding to non-local interfaces. (#1653) 2026-03-30 10:17:52 -07:00
lesyk a6c8ac46a6 Fix O(n) memory growth in PDF conversion by calling page.close() afte… (#1612)
* Fix O(n) memory growth in PDF conversion by calling page.close() after each page

* Refactor PDF memory optimization tests for improved readability and consistency

* Add memory benchmarking tests for PDF conversion with page.close() fix

* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code

* Bump version to 0.1.6b2 in __about__.py

* Update PDF conversion tests to include mimetype in StreamInfo
2026-03-16 10:35:24 -07:00
lesyk c6308dc822 [MS] Add OCR layer service for embedded images and PDF scans (#1541)
* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files
2026-03-10 09:17:17 -07:00
afourney 4a5340f93b Bump version for release. (#1564) v0.1.5 2026-02-20 11:40:57 -08:00
Bas Nijholt 6b0fd15e60 Remove onnxruntime<=1.20.1 Windows pin (#1551) 2026-02-16 15:05:37 -08:00
afourney 2b6ec9f315 Add text/markdown to Accept header (#1554) 2026-02-13 11:53:01 -08:00
lesyk c83de14a9c [MS] Extend table support for wide tables (#1552)
* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests
2026-02-13 10:45:39 -08:00
lesyk 7fdaefb724 Fix: PDF parsing doesn't support partially numbered lists (#1525)
* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests
v0.1.5b1
2026-01-08 15:15:22 -08:00
lesyk 251dddcf0c [MS] Update PDF table extraction to support aligned Markdown (#1499)
* Added PDF table extraction feature with aligned Markdown (#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>
2026-01-07 16:38:45 -08:00
afourney dde250a456 Bump versions of mammoth and pdfminer.six (#1492)
* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.
v0.1.4
2025-12-01 10:11:24 -08:00
afourney 3d4fe3cdcc Upgrade mammoth to 1.11.0 (#1452) 2025-10-20 16:07:39 -07:00
afourney 447c047731 Test if mammoth resolves rlinks. (#1451) 2025-10-20 15:54:05 -07:00
Meirna 8a9d8f1593 feat: add checkbox support to Markdown converter (#1208)
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
v0.1.3
2025-08-26 15:30:47 -07:00
Richard Ye 17365654c9 Handle PPTX shapes where position is None (#1161)
* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front
2025-08-26 15:28:17 -07:00
Yuzhong Zhang 59eb60f8cb fix docx parse error(\n in alt) (#1163) 2025-08-26 15:20:17 -07:00
Dmitry 459d462f29 docs: correct minor typos (#1173) 2025-08-26 15:15:23 -07:00
Noah Zhu c3f6cb356c Adding support for data-src Attribute (#1226)
* supportfordata-src
2025-08-26 15:11:53 -07:00
Ebrahim Tayabali 0c4d3945a0 Update README.md (#1191)
Fix: Subtle spelling mistake fixed.
2025-08-26 15:07:27 -07:00
Utkarsh kumar f8b60b5403 Update README.md (#1350)
ISSUE #1339
2025-08-26 15:02:56 -07:00
[W]DOS_ 16ca285d30 Update README.md (#1335)
Fix typo in README.md
2025-08-26 14:55:58 -07:00
Stefan Rink b81a387616 fix: correctly pass custom llm prompt parameter (#1319)
* fix: correctly pass custom llm prompt parameter
2025-08-26 14:51:10 -07:00
safen0s ea1a3dfb60 Add HTML support to DocumentIntelligenceConverter (#1352) 2025-08-26 14:34:43 -07:00
dependabot[bot] b6e5da8874 Bump actions/checkout from 4 to 5 (#1394)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
2025-08-26 14:27:38 -07:00
t3tra fb1ad24833 Ensure safe ExifTool usage: require >= 12.24 (#1399)
* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------
2025-08-26 14:25:13 -07:00
JonahDelman 1178c2e211 Fixed documentation typos in _base_converter.py (#1393) 2025-08-26 14:23:10 -07:00
afourney 9278119bb3 Resolved an issue with linked images in docx [mammoth] (#1405) 2025-08-26 14:20:29 -07:00
onefloid da7bcea527 docs: rephrase sentence (#1278) 2025-06-03 21:09:25 -07:00
afourney 3bfb821c09 Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273)
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ

* Update the Dockerfile to enable plugins. No puglins are installed by default.
2025-06-03 09:35:33 -07:00
Tomasz Kalinowski 62b72284fe pin onnxruntime on Windows (#1274)
closes #1266
2025-05-28 13:13:51 -07:00
afourney 1dd3c83339 Promoting 0.1.2a1 to 0.1.2 (#1272) v0.1.2 2025-05-28 10:04:42 -07:00
afourney 9dc982a3b1 Small changes to favor streamable HTTP over deprecated SSE (#1264) 2025-05-23 11:39:41 -07:00
afourney effde4767b Preparing a pre-release of 0.1.2 (#1260) v0.1.2a1 2025-05-21 15:24:56 -07:00
rtpacks 04bf831209 docs: fix typos (#1201) 2025-05-21 15:12:22 -07:00
Betula-L 9fd680c366 support streamable http mcp (#1245)
Co-authored-by: luhualin
2025-05-21 14:34:50 -07:00
一I 38261fd31c Update Python version requirement and add .cursorrules to .gitignore (#1249)
* update markdown
* Update and install Python version suggestions
* Update README with prerequisites.
---------

Co-authored-by: Lucas Liu <lucas@LucasdeMacBook-Pro.local>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:47:29 -07:00
Yi-Cheng Wang 131f0c7739 feat: add Document Intelligence API version selection via kwargs (#1253)
Co-authored-by: Yi-Cheng Wang <yicheng.wang@heph-ai.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:22:08 -07:00
JoshClark-git 56f7579ce2 FIX YouTube transcript errors (#1241)
* FIX YouTube transcript errors

* Fixed formatting.

---------

Co-authored-by: Josh <jca351@sfu.ca>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:17:57 -07:00
t3tra cb421cf9ea Chore: Make linter happy (#1256)
* refactor: remove unused imports

* fix: replace NotImplemented with NotImplementedError

* refactor: resolve E722 (do not use bare 'except')

* refactor: remove unused variable

* refactor: remove unused imports

* refactor: ignore unused imports that will be used in the future

* refactor: resolve W293 (blank line contains whitespace)

* refactor: resolve F541 (f-string is missing placeholders)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:02:16 -07:00
kira-offgrid 39e7252940 fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse-packages-markitdown-src-markitdown-converter_utils-docx-math-omml.py (#1251) 2025-05-21 09:57:21 -07:00
afourney bbcf876b18 Switched from the stdlib minidom parser to defusedxml. (#1259) 2025-05-21 09:47:14 -07:00
createcentury 041be54471 Update README.md (#1187)
updated subtle misspelling.
2025-04-13 09:31:40 -07:00
lentil32 ebe2684b3d chore: fix typo in README.md (#1175)
* chore: fix typo in README.md
2025-04-13 09:29:16 -07:00
Turdıbek 8576f1d915 Add CSV to Markdown table conversion - fixes #1144 (#1176)
* feat: Add CSV to Markdown table converter

- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class

----

Thanks also to @benny123tw who submitted a very similar PR in #1171
2025-04-13 09:19:00 -07:00
Sathindu 3fcd48cdfc feat: render math equations in .docx documents (#1160)
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
2025-03-28 15:36:38 -07:00
afourney 9e067c42b6 Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151)
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
2025-03-26 10:44:11 -07:00
afourney 9a951055f0 Update readme to point to the mcp package. (#1158)
* Updated readme with link to the MCP package.
2025-03-25 15:00:04 -07:00
afourney 73b9d57312 Update badges (#1157)
* Update badges in subpackages.
2025-03-25 14:52:24 -07:00
afourney 3ca57986ef Basic SSE MCP Server for MarkItDown (#1155)
* Added an initial minimal MCP server for MarkItDown
* Added STDIO default option.
* Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop
* Pin mcp version.
2025-03-25 14:38:22 -07:00