Extracting Text From Ms Word Document Uploaded Through Fileupload From Ipywidgets In Jupyter Notebook
Solution 1:
Modern ms-word files (.docx
) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml
in the zip-file.
The python-docx
module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>>import docx>>>gkzDoc = docx.Document('grokonez.docx')>>>fullText = []>>>for paragraph in doc.paragraphs:... fullText.append(paragraph.text)...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data
is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
- write
rawdata
to e.g.foo.docx
and open that. That would certainly work, but it does seem somewhat un-elegant. docx.Document
can work with file-like objects. So you could create anio.BytesIO
object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
Solution 2:
Tweaking with @Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx importDocumentupload= widgets.FileUpload()
uploadrawdata= upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)
Post a Comment for "Extracting Text From Ms Word Document Uploaded Through Fileupload From Ipywidgets In Jupyter Notebook"