Solve current problems

Version 11 - Updated on 18 Nov 2017 at 11:45PM by Joachim Hansen

Description

  1. elasticsearch might have stopped working due to lack of memory or something when processing the large malware dataset
    1. I used --verbose with curl (when indexing) to see that I did not have access to 127.0.0.1:9200
    2. I also check localhost:9200 in the firefox browser and did not have access to it
      1. I regain access to localhost:9200 with 
        sudo systemctl start elasticsearch.service
        sudo systemctl stop elasticsearch.service
    3. I decided to remove lines less then 21 characters and split the dataset into subparts of 1000 lines
    4. I have tried indexing 1 file (but I have some JSON parser problems)
    5. Most JSON parser errors could be removed by escaping the escape character \ like so cat dumperImportBatch00  | sed -e 's/\\/\\\\/g' to the test import batch
    6. I should apply  sed -e 's/\\/\\\\/g' to the original file (dumper2NoLinesLessthen21Char2)
    7. Some JSON errors still persists like invalid UTF-8 start byte  (remove non UTF-8 characters)
    8. The JSON parser error illegal unquated character ((CTRL-CHAR, char code) https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

http://www.theasciicode.com.ar/ascii-control-characters/unit-separator-ascii-code-31.html (shows list of non printable ascii characters)

can also remove non printable characters like this https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/

Can remove non ascii characters https://stackoverflow.com/questions/3337936/remove-non-ascii-characters-from-csv


  1. Remove the non printable ascii characters and the non ascii characters (look at the file to see if this removes alot of content or not... if not this is a viable solution)
  2. Make a forall loop that can process all the created batches of 1000 lines each and append ElasticPreproccing (do same for Solr)
  3. Make a forall loop with (time in front of it) that send 1 and 1 batch to the Elasticserver (though curl) and the same with solr
  4. Write report while waiting.  


#!/bin/bash

# GNU bash, version 4.3.46

#go though all the files (no directories)... I dont really care about the order of the batches
for f in $(ls -p | grep -v '/');
do
echo "$1$f";
done


#!/bin/bash

# GNU bash, version 4.3.46
# Should be called like time ./bashname.sh indexName/outputDolderName
#go though all the files (no directories)... I dont really care about the order of the batches
#This .sh file have to be run in the same directory as the batch files to be indexed.
#using pwd to get current working directory
# using command line arguments to get the index name
currDir=$(pwd);
indexName=$1;
echo "index name=$indexName"
for f in $(ls -p | grep -v '/');
do
echo "index $currDir/$f";
done


remove non ascii characters, remove non printable ascii characters (except new line)  and then esacape / with //

cat /home/search/Downloads/Datasets/dumper2NoLinesLessthen21Char2 |  perl -pe 's/[^[:ascii:]]//g;' | tr -cd '\12\40-\176' | sed -e 's/\\/\\\\/g' >> /home/search/Downloads/dumpertrLineFeedOnly

this fixed all of the remaining json parser errors atlest for the 1st batch (think it solved it for all batches)