Scripts Preparing datasets for import case 1

Version 10 - Updated on 05 Nov 2017 at 6:13PM by Joachim Hansen

Description

  • awk 'NF > 0' filename
    - https://stackoverflow.com/questions/10347653/awk-remove-blank-lines
  • Command is used to remove blank lines (still allows whitespace which is good)
  • sed 's/"//g - https://stackoverflow.com/questions/8008546/remove-unwanted-character-using-awk
  • Command above removes all occurances of the character "
  • cat /home/search/Downloads/blanLinesDS.json | awk 'NF > 0' | sed 's/"//g' | sed -e 's/^/{"content":"/'
  • Command above removes blank lines, remove all occurances of " and adds prefix {"content":"/' to every line 
  • https://stackoverflow.com/questions/2099471/add-a-prefix-string-to-beginning-of-each-line
  • TODO: Should find a command that can add the suffix "} to each line 
  • Suffix command awk 'NF{print $0 " \"}"}' 
  • TODO test if one line works for upload as a JSON document in ElasticSearch {"content":"{id:969,first_name:Kaile,last_name:MacKinnon,email:kmackinnonqw@example.com,gender:Female,ip_address:88.38.12.64}, "}
    • curl -XPUT 'localhost:9200/twitter/tweet/1?pretty' -H 'Content-Type: application/json' -d'
      {
          "user" : "kimchy",
          "post_date" : "2009-11-15T14:12:12",
          "message" : "trying out Elasticsearch"
      }
      '
  • really important with -H 'Content-Type: application/json' and -d' and the ending '... But this is just for the single document import method... just important for testing purposes here and now.
  • The difference between the preparing of elasticSearch and Solr for import case 1 would primarily be the suffix and prefix???
  • Maby I cannot have nested curly braces in the jason document  {}?

curl -s -H "Content Type: application/json" -XPUT localhost:9200/accounts/person/1 {"content":"{id:969,first_name:Kaile,last_name:MacKinnon,email:kmackinnonqw@example.com,gender:Female,ip_address:88.38.12.64}, "}

above do not work for import as a singe doument just jet. 


search@search-virtual-machine:~$ curl -XPUT 'localhost:9200/twitter/tweet/522?pretty' -H 'Content-Type: application/json' -d'
{
"content":"{id:969,first_name:Kaile,last_name:MacKinnon,email:kmackinnonqw@example.com,gender:Female,ip_address:88.38.12.64}, "
}
'
{
  "_index" : "twitter",
  "_type" : "tweet",
  "_id" : "522",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}


This worked... (i dont need to remove character : or curly braces {} that is within quotation marks (i.e. "{" or "}")??? for elasticsearch import case 1


Script for elasticsearch preprocessing:

cat /home/search/Downloads/blanLinesDS.json | awk 'NF > 0' | sed 's/"//g' | sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' |  awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}'

The script does the following actions in a given order from 1 to 8

  1. Remove all blank lines
  2. Remove all occurances of the character "
  3. Insert the prefix {"content":" to the start of each line
  4. Insert the suffix "} to the end of each line
  5. Insert the string { \"index\":{} }" between every line
  6. Insert { \"index\":{} }" at the start of the file
  7. { \"index\":{} }" also is on the last line (We dont want that so this instance is removed)
  8. Print a new empty line