awk 'NF > 0' filename
- https://stackoverflow.com/questions/10347653/awk-remove-blank-lines- Command is used to remove blank lines (still allows whitespace which is good)
- sed 's/"//g - https://stackoverflow.com/questions/8008546/remove-unwanted-character-using-awk
- Command above removes all occurances of the character "
- cat /home/search/Downloads/blanLinesDS.json | awk 'NF > 0' | sed 's/"//g' | sed -e 's/^/{"content":"/'
- Command above removes blank lines, remove all occurances of " and adds prefix {"content":"/' to every line
- https://stackoverflow.com/questions/2099471/add-a-prefix-string-to-beginning-of-each-line
- TODO: Should find a command that can add the suffix "} to each line
- Suffix command awk 'NF{print $0 " \"}"}'
- TODO test if one line works for upload as a JSON document in ElasticSearch {"content":"{id:969,first_name:Kaile,last_name:MacKinnon,email:kmackinnonqw@example.com,gender:Female,ip_address:88.38.12.64}, "}
- curl -XPUT 'localhost:9200/twitter/tweet/1?pretty' -H 'Content-Type: application/json' -d'
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
'
- really important with -H 'Content-Type: application/json' and -d' and the ending '... But this is just for the single document import method... just important for testing purposes here and now.
- The difference between the preparing of elasticSearch and Solr for import case 1 would primarily be the suffix and prefix???
- Maby I cannot have nested curly braces in the jason document {}?
curl -s -H "Content Type: application/json" -XPUT localhost:9200/accounts/person/1 {"content":"{id:969,first_name:Kaile,last_name:MacKinnon,email:kmackinnonqw@example.com,gender:Female,ip_address:88.38.12.64}, "}
above do not work for import as a singe doument just jet.
search@search-virtual-machine:~$ curl -XPUT 'localhost:9200/twitter/tweet/522?pretty' -H 'Content-Type: application/json' -d'
{
"content":"{id:969,first_name:Kaile,last_name:MacKinnon,email:kmackinnonqw@example.com,gender:Female,ip_address:88.38.12.64}, "
}
'
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "522",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : true
}
This worked... (i dont need to remove character : or curly braces {} that is within quotation marks (i.e. "{" or "}")??? for elasticsearch import case 1
Script for elasticsearch preprocessing:
cat /home/search/Downloads/blanLinesDS.json | awk 'NF > 0' | sed 's/"//g' | sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' | awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}'
The script does the following actions in a given order from 1 to 8
- Remove all blank lines
- Remove all occurances of the character "
- Insert the prefix {"content":" to the start of each line
- Insert the suffix "} to the end of each line
- Insert the string { \"index\":{} }" between every line
- Insert { \"index\":{} }" at the start of the file
- { \"index\":{} }" also is on the last line (We dont want that so this instance is removed)
- Print a new empty line