Extracting Text From Ms Word Document Uploaded Through Fileupload From Ipywidgets In Jupyter Notebook

July 24, 2024 Post a Comment

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am g

Solution 1:

Modern ms-word files (.docx) are actually zip-files.

The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.

The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.

>>>import docx>>>gkzDoc = docx.Document('grokonez.docx')>>>fullText = []>>>for paragraph in doc.paragraphs:...    fullText.append(paragraph.text)...

Note that this will only extract the text from paragraphs. Not e.g. the text from tables.

Edit:

I want to be able to upload the MS file through the FileUpload widget.

There are a couple of ways you can do that.

Baca Juga

First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:

rawdata = upload.data[0]

(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)

write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.

Like this:

foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

Solution 2:

Tweaking with @Roland Smith great suggestions, following code finally worked:

import io
import docx
from docx importDocumentupload= widgets.FileUpload()
    uploadrawdata= upload.data[0]
    test = io.BytesIO(rawdata)
    doc = Document(test)

    for p in doc.paragraphs:
        print (p.text)

Python Guru

Extracting Text From Ms Word Document Uploaded Through Fileupload From Ipywidgets In Jupyter Notebook

Solution 1:

Solution 2:

Post a Comment for "Extracting Text From Ms Word Document Uploaded Through Fileupload From Ipywidgets In Jupyter Notebook"