Bug
_walk_linear() in mspowerpoint_backend.py crashes when processing PPTX files containing <p:sp> elements that python-pptx cannot classify.
handle_groups() and handle_shapes() access shape.shape_type without catching NotImplementedError.
python-pptx's Shape.shape_type only recognizes 4 subtypes (placeholder, freeform, autoshape, textbox). For any <p:sp> with an empty <p:spPr> — no geometry definition, no placeholder, no txBox attribute — it raises:
NotImplementedError("Shape instance of unrecognized shape type")
This kills the entire PPTX conversion, even though the shape's text, tables, etc. would be extractable through the rest of the pipeline (which doesn't use shape_type).
Traceback
File "docling/backend/mspowerpoint_backend.py", line 684, in handle_shapes
handle_groups(shape, parent_slide, slide_ind, doc, slide_size)
File "docling/backend/mspowerpoint_backend.py", line 711, in handle_groups
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
^^^^^^^^^^^^^^^^
File "pptx/shapes/autoshape.py", line 325, in shape_type
raise NotImplementedError("Shape instance of unrecognized shape type")
Steps to reproduce
from lxml import etree
from pptx import Presentation
from docling.document_converter import DocumentConverter
from pathlib import Path
# Create a PPTX with a shape python-pptx can't classify
prs = Presentation()
slide = prs.slides.add_slide(prs.slide_layouts[6])
# <p:sp> with empty <p:spPr> — no geometry, no placeholder, no txBox
sp_xml = """
<p:sp xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<p:nvSpPr>
<p:cNvPr id="100" name="BadShape"/>
<p:cNvSpPr/>
<p:nvPr/>
</p:nvSpPr>
<p:spPr/>
<p:txBody>
<a:bodyPr/>
<a:p><a:r><a:t>Text in an unrecognized shape</a:t></a:r></a:p>
</p:txBody>
</p:sp>
"""
slide._element.spTree.append(etree.fromstring(sp_xml))
prs.save("/tmp/bad_shape.pptx")
# This crashes
converter = DocumentConverter()
converter.convert(Path("/tmp/bad_shape.pptx"))
Docling version
2.84.0
Python version
3.11.7
Bug
_walk_linear()inmspowerpoint_backend.pycrashes when processing PPTX files containing<p:sp>elements that python-pptx cannot classify.handle_groups()andhandle_shapes()accessshape.shape_typewithout catchingNotImplementedError.python-pptx's
Shape.shape_typeonly recognizes 4 subtypes (placeholder, freeform, autoshape, textbox). For any<p:sp>with an empty<p:spPr>— no geometry definition, no placeholder, notxBoxattribute — it raises:NotImplementedError("Shape instance of unrecognized shape type")
This kills the entire PPTX conversion, even though the shape's text, tables, etc. would be extractable through the rest of the pipeline (which doesn't use
shape_type).Traceback
Steps to reproduce
Docling version
2.84.0
Python version
3.11.7