Preproccsing 3

Version 6 - Updated on 12 Nov 2017 at 11:38PM by Joachim Hansen

Description

awk alternative for substitution https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands

Remember to escape sed with \ very important with for example open bracket [

Remove common JSON characters : ] [ } { , 

sed -e 's/:/ /g' -e 's/]/ /g' -e 's/{/ /g' -e 's/}/ /g' -e 's/,/ /g' -e 's/\[/ /g' 

Remove dataset spesific fields

| sed -e "s/\"id\"//g" -e "s/\"first\_name\"//g" -e "s/"last\_name"//g" -e "s/\"email\"//g" -e "s/\"gender\"//g" -e "s/\"ip\_address\"//g"

Remove "

 sed 's/"//g'

Remove empty lines

awk 'NF > 0'

Remove consecutive spaces (have only one space)

's/  \+/ /g'

Elastic Spesific preproccsing

 sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' | awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}' 

Elastic search preproccesing 

cat /home/search/Downloads/blanLinesDS.json | sed -e 's/:/ /g' -e 's/]/ /g' -e 's/{/ /g' -e 's/}/ /g' -e 's/,/ /g' -e 's/\[/ /g' | sed -e "s/\"id\"//g" -e "s/\"first\_name\"//g" -e "s/"last\_name"//g" -e "s/\"email\"//g" -e "s/\"gender\"//g" -e "s/\"ip\_address\"//g" | sed 's/"//g' | awk 'NF > 0' | sed 's/ \+/ /g' | sed -e 's/^/{"content":"/' | awk 'NF{print $0 " \"}"}' | awk ' {print;} NR % 1 == 0 { print "{ \"index\":{} }"; }' | awk 'BEGIN{print "{ \"index\":{} }";}{print;}' |head -n -1 | awk 'END { print "";}{print;}'

Import to Elastic

curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/index_x5/clean/_bulk?pretty --data-binary @/home/search/Downloads/cleanelastic.json

Search Elastic

curl -XGET 'localhost:9200/index_x5/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "content" : "Keen"
        }
    }
}
'

search query worked well. The search query also seem also to only match exact matches. Did it work so much better now that I have cleaned up the imported documents from including json structure or is it the search query that now is correctly formed? Anyways I search the index index_x5 which is the index that contains my clean up JSON documents (dummy dataset)