Thoughts

Version 10 - Updated on 07 Nov 2017 at 1:25AM by Joachim Hansen

Description

  • Not a big problem to get to import case 1 with elastisearch (data kinda unstructured in one content field)
  • Problems with how to search this field _search {"quary": {"match_phrase": {"content": "Joachim Hansen"}}} and {"quary": {"match": {"content": "york"}}} matches dummy JSON documents that it is not in any significant way similar to the search phrase.
  • As the import data is unstructured maby the practical research question should be rewritten to reflect this
  • Maybe I can use some substring matching?... maby wildcard or prefix/suffix search. (Becouse I think that it consider the content field as just one long string. 
  • I think there is a lack of support for suffix searching, but there is support for prefix searching. Maybe try prefix seraching for : + word or : + sentance. Or remove all : and use prefix search then
  • search for search engine name + fulltext search to figure out how to formulate the search quary. 
  • I should consider turning JSON to csv and import that as bulk JSON (this removes JSON fields from this string/document.) 
  • I belive that the JSON fields might negativly effect the relevance score. As these repeated fields is part of each document. 
  • Get JSON to CSV cannot easily be done without spesicifing fields to write. Therefore I think the best approch is to ditch this idea and remove all fields in dataset by sed command, remove all { } [ ] " and replace , with space and : with space .
  • Difference between remove and replace is that remove is replace with empty char "" and replace is to replace one char with anouther (e.g comma , with space)
  • By removing all JSON strutcure from the document then this should not effect relevancy score.
  • the script that removes empty lines will fix the problem when removing the only content on certain lines.