Index a MS Office documents to Elasticsearch

We can use ElasticSearch to index MS office documents : .doc, .docx, .xls, .xlsx and .csv files and store these documents to MongoDB.

In the diagram, it shows four steps to index a document,  upload the document to db, search the document by key words and download the document via link.

1. Index a document

Word format and PDF formt

curl -X DELETE http://localhost:9200/docx
curl -X PUT http://localhost:9200/docx
echo;
curl -X POST http://localhost:9200/docx/step/_mapping -d '{
      "step": {
        "properties": {
          "file": {
            "type": "attachment",
            "path": "full",
            "fields": {
              "file": {
                "type": "string",
                "term_vector":"with_positions_offsets",
                "store": true
              }
            }
          }
        }
      }
}'
echo;
echo '>>> Index the document'
curl -XPOST "http://localhost:9200/docx/step/read_doc?pretty=1" -d '
{
    "file" : {
        "_content" : "'`base64 read.doc | perl -pe 's/\n/\\n/g'`'"
    }
}'
curl -XPOST "http://localhost:9200/docx/step/pdf?pretty=1" -d '
{
    "file" : {
        "_content" : "'`base64 auto.pdf | perl -pe 's/\n/\\n/g'`'"
    }
}'
echo;
curl -X POST http://localhost:9200/docx/_refresh

Index Excel file

Excel-2-Elasticsearch (xl2es.pl) is a small and quick Perl script to inject records from MS Excel (.xlsx as well as .xls) directly into Elasticsearch

curl -X DELETE http://localhost:9200/x12es
curl -X PUT http://localhost:9200/x12es
echo;
echo '>>> Index the document'
./xl2es.pl -i x12es -t xldata -s localhost:9200 -x example.xls -v

echo;
curl -X POST http://localhost:9200/x12es/_refresh

2. Upload the documents to DB via service

Solution 1:  mongofiles put

mongofiles put  example.xls --local  example.xls -h localhost --db test

Solution 2:  Store Uploaded File in MongoDB GridFS Using mgo

mgo (pronounced as mango) is a MongoDB driver for the Go language that implements a rich and well tested selection of features under a very simple API following standard Go idioms.

See attached file - khargo.go

khargo.go

Notes:  if your files are all smaller the 16 MB BSON Document Size limit, consider storing the file manually within a single document.

3. Query key words to get a document link

User can let ElasticSearch to perform full text searching via key words, and get the link of a document which was hit.

Query Word files

echo; echo ">>> Search for function 'program'"

curl "http://localhost:9200/docx/_search?pretty=true&q=program&fields=*"

Query the Excel files

echo ">>> Search for Name: 'Anand'"
curl -XGET 'http://localhost:9200/x12es/_search?q=Name:Anand&&pretty=true'

4. Download the document from MongoDB GridFS via link

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.

Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, [1] and stores each of those chunks as a separate document. By default GridFS limits chunk size to 255k. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.  When you query a GridFS store for a file, the driver or client will reassemble the chunks as needed. You can

Reference

  1. https://github.com/chetanganatra/Excel-2-Elasticsearch
  2. http://docs.mongodb.org/manual/core/gridfs/
  3. https://labix.org/mgo
  4. https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

About hustbill

2008-today Software System Specialist at Diebold Corporation. 2005-2008 Senior Software engineer at Alcatel-Lucent.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s