Datasets for Elasticsearch preprocesing

Version 3 - Updated on 05 Nov 2017 at 7:05PM by Joachim Hansen

Description

Script for elasticsearch preprocessing:

cat /home/search/Downloads/blanLinesDS.json | awk 'NF > 0' | sed 's/"//g' | sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' |  awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}'

The script does the following actions in a given order from 1 to 8

  1. Remove all blank lines
  2. Remove all occurances of the character "
  3. Insert the prefix {"content":" to the start of each line
  4. Insert the suffix "} to the end of each line
  5. Insert the string { \"index\":{} }" between every line
  6. Insert { \"index\":{} }" at the start of the file
  7. { \"index\":{} }" also is on the last line (We dont want that so this instance is removed)
  8. Print a new empty line


curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/index_test3/working/_bulk?pretty --data-binary @/home/search/Downloads/pretest.json >> /home/search/Downloads/indexoutput2.txt

I was able to import the preproccesed dataset. The testing dataset contained the characters [ ] { } : , within quatation marks (that worked fine)

curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/index_test4/working/_bulk?pretty --data-binary @/home/search/Downloads/pretest.json | head -20
{
"took" : 91,
"errors" : false,
"items" : [
{
"index" : {
"_index" : "index_test4",
"_type" : "working",
"_id" : "AV-NVqBEFyZDNdvXeuVa",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : true,
"status" : 201
}
},

I added the | head -20 to print the 20 first line of the output of the response from the index proccess

As I can see it took 91 milliseconds