豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

mspowerpoint_backend: handle_groups/handle_shapes crash on shapes with unrecognized shape_type #3308

@pateltejas

Description

@pateltejas

Bug

_walk_linear() in mspowerpoint_backend.py crashes when processing PPTX files containing <p:sp> elements that python-pptx cannot classify.
handle_groups() and handle_shapes() access shape.shape_type without catching NotImplementedError.
python-pptx's Shape.shape_type only recognizes 4 subtypes (placeholder, freeform, autoshape, textbox). For any <p:sp> with an empty <p:spPr> — no geometry definition, no placeholder, no txBox attribute — it raises:

NotImplementedError("Shape instance of unrecognized shape type")

This kills the entire PPTX conversion, even though the shape's text, tables, etc. would be extractable through the rest of the pipeline (which doesn't use shape_type).

Traceback

File "docling/backend/mspowerpoint_backend.py", line 684, in handle_shapes                                                            
   handle_groups(shape, parent_slide, slide_ind, doc, slide_size)
File "docling/backend/mspowerpoint_backend.py", line 711, in handle_groups                                                            
   if shape.shape_type == MSO_SHAPE_TYPE.GROUP:                                                                                      
      ^^^^^^^^^^^^^^^^                                                                                                               
File "pptx/shapes/autoshape.py", line 325, in shape_type                                                                              
   raise NotImplementedError("Shape instance of unrecognized shape type")    

Steps to reproduce

from lxml import etree                                                                                                                
from pptx import Presentation                                                                                                         
from docling.document_converter import DocumentConverter                                                                              
from pathlib import Path                                                                                                              
                                                                                                                                      
# Create a PPTX with a shape python-pptx can't classify                                                                               
prs = Presentation()                                                                                                                  
slide = prs.slides.add_slide(prs.slide_layouts[6])                                                                                    
                                                                                                                                      
# <p:sp> with empty <p:spPr> — no geometry, no placeholder, no txBox                                                                  
sp_xml = """                                                                                                                          
<p:sp xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"                                                            
      xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"                                                                 
      xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">                                                  
  <p:nvSpPr>                                                                                                                          
    <p:cNvPr id="100" name="BadShape"/>                                                                                               
    <p:cNvSpPr/>                                                                                                                      
    <p:nvPr/>                                                                                                                         
  </p:nvSpPr>                                                                                                                         
  <p:spPr/>                                                                                                                           
  <p:txBody>                                                                                                                          
    <a:bodyPr/>                                                                                                                       
    <a:p><a:r><a:t>Text in an unrecognized shape</a:t></a:r></a:p>                                                                    
  </p:txBody>                                                                                                                         
</p:sp>                                                                                                                               
"""                                                                                                                                   
slide._element.spTree.append(etree.fromstring(sp_xml))                                                                                
prs.save("/tmp/bad_shape.pptx")                                                                                                       
                                                                                                                                      
# This crashes                                                                                                                        
converter = DocumentConverter()                                                                                                       
converter.convert(Path("/tmp/bad_shape.pptx"))                                                                                        

Docling version

2.84.0

Python version

3.11.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions