Version 10 - Updated on 07 Nov 2017 at 1:25AM by Joachim Hansen
Description
Not a big problem to get to import case 1 with elastisearch (data kinda unstructured in one content field)
Problems with how to search this field _search {"quary": {"match_phrase": {"content": "Joachim Hansen"}}} and {"quary": {"match": {"content": "york"}}} matches dummy JSON documents that it is not in any significant way similar to the search phrase.
As the import data is unstructured maby the practical research question should be rewritten to reflect this
Maybe I can use some substring matching?... maby wildcard or prefix/suffix search. (Becouse I think that it consider the content field as just one long string.
I think there is a lack of support for suffix searching, but there is support for prefix searching. Maybe try prefix seraching for : + word or : + sentance. Or remove all : and use prefix search then
search for search engine name + fulltext search to figure out how to formulate the search quary.
I should consider turning JSON to csv and import that as bulk JSON (this removes JSON fields from this string/document.)
I belive that the JSON fields might negativly effect the relevance score. As these repeated fields is part of each document.
Get JSON to CSV cannot easily be done without spesicifing fields to write. Therefore I think the best approch is to ditch this idea and remove all fields in dataset by sed command, remove all { } [ ] " and replace , with space and : with space .
Difference between remove and replace is that remove is replace with empty char "" and replace is to replace one char with anouther (e.g comma , with space)
By removing all JSON strutcure from the document then this should not effect relevancy score.
the script that removes empty lines will fix the problem when removing the only content on certain lines.