Version 1 - Updated on 03 Nov 2017 at 12:16AM by Joachim Hansen
Description
Have done
ElasticSearch, Solr and Sphinx are installed and running
Have used a dummy JSON dataset (1000 JSON documents) and uploaded it in bulk in Elasticsearch and performed some search on it
Search time for both Solr and ElasticSearch are shown in search response (shown in milliseconds)
I have found the commands I will use to capture memory of search engines while searching top -bn 10 -p 8254 | grep "^ " | awk '{ printf("%-8s %-8s %-8s %-8s %-8s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12); }' >> /home/search/Downloads/file6.txt (see commands misc notebook for an example output). With this command I want to capture the search engines memory before, during and after execution of search quary.
I have crated a candiate list of experiment datasets, that have been chosen based on dataset catagory and the data format of the dataset (JSON datasets where looked at with more favor than other dataset types) See notebook Candiate list of datasets for experiments
Challenges:
Solr and ElasticSearch can add JSON documents in bulk/batch, but the JSON document have to be formatted in a particular format to be parseable (example is in elastic misc notebook and solr misc notebook). There are differerent format for Solr and elasticSearch
Sphinx cannot index JSON. It can index .csv (comma seperated and tabular seperated) and in various SQL formats. A possible solution is to have a SQL table with two columns (column 1: Super key (ID) and column 2: raw JSON document)
I consider preparing the datasets for import the most challanging step. The preperation process would be different for each of the 3 search engines. Considering how much time this may take maybe I dont test all the candidate datasets on all the search engines?
TODO:
I should test the dummy JSON dataset on Solr as well (including the bulk import)
Figure out how I will formulate the search quaries with respect to the search engine. For example searching the field _source in elasticSearch or using the match_all quary
Prepare and import in bulk (not feasible to import one and one JSON documents) the candiate datasets in Solr and ElasticSearch