awk alternative for substitution https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands
Remember to escape sed with \ very important with for example open bracket [
Remove common JSON characters : ] [ } { ,
sed -e 's/:/ /g' -e 's/]/ /g' -e 's/{/ /g' -e 's/}/ /g' -e 's/,/ /g' -e 's/\[/ /g'
Remove dataset spesific fields
| sed -e "s/\"id\"//g" -e "s/\"first\_name\"//g" -e "s/"last\_name"//g" -e "s/\"email\"//g" -e "s/\"gender\"//g" -e "s/\"ip\_address\"//g"
Remove "
sed 's/"//g'
Remove empty lines
awk 'NF > 0'
Remove consecutive spaces (have only one space)
's/ \+/ /g'
Elastic Spesific preproccsing
sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' | awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}'
Elastic search preproccesing
cat /home/search/Downloads/blanLinesDS.json | sed -e 's/:/ /g' -e 's/]/ /g' -e 's/{/ /g' -e 's/}/ /g' -e 's/,/ /g' -e 's/\[/ /g' | sed -e "s/\"id\"//g" -e "s/\"first\_name\"//g" -e "s/"last\_name"//g" -e "s/\"email\"//g" -e "s/\"gender\"//g" -e "s/\"ip\_address\"//g" | sed 's/"//g' | awk 'NF > 0' | sed 's/ \+/ /g' | sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' | awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}'
Import to Elastic
curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/index_x5/clean/_bulk?pretty --data-binary @/home/search/Downloads/cleanelastic.json
Search Elastic
curl -XGET 'localhost:9200/index_x5/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"content" : "Keen"
}
}
}
'
search query worked well. The search query also seem also to only match exact matches. Did it work so much better now that I have cleaned up the imported documents from including json structure or is it the search query that now is correctly formed? Anyways I search the index index_x5 which is the index that contains my clean up JSON documents (dummy dataset)