General preproccing

Version 21 - Updated on 21 Nov 2017 at 1:54AM by Joachim Hansen

Description

______________________________________________________________________________________________________________________________________________

Andri malware dataset:

I should state in my thesis Is that I got this dataset by requesting my supervisor Andrii for it.

Downloaded from link in virtual machine

unzipped with command 7z x

unzipped all induvdual sql files

added all unzipped sql files to a folder


maybe concatenate all files in the folder...Or I could preproccess and index all the subfiles seperatly ... as one big document might be to big to process.... One of the subfiles are already to much for gedit to handle.

when using wc -l on one of the files I got 10626 lines

I wanted to check how many of the lines ("/n") are empty lines with the command grep -cvP '\S' and it shown 0. I assume that entries are on seprate lines?

I also counted with cat doc.sql | dos2unix | grep -cvP '\n,' and i got 10626 ... it seems that every new line starts with , which makes sense that new entries in insert are separated with ,


Maby do this: 

  1. use the dos2unix command
  2. I should substitute character "


  3. Elasticsearch preprossing or Solr preproccing


Renamed scema to A-Schema.sql to make it come first in alphabetic order.

used mysql -u root -p 

 

Enter password <admin>

then 

mysql > create database dumper2;

then

https://stackoverflow.com/questions/4708013/import-multiple-sql-dump-files-into-mysql-database-from-shell

ls -1 *.sql | awk '{ print "source",$0 }' | mysql --batch -u root -p dumper2

Enter password <admin>


SELECT table_schema "DB Name", 
Round(Sum(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB" 
FROM   information_schema.tables 
GROUP  BY table_schema;

Used this command to calculate the size of the database to see if the import was corrctly

23.3578 GB was imported which seems about right.

Next I should figure ot how to dump this file to a csv file. 

SELECT * FROM rawData
INTO OUTFILE '/var/lib/mysql-files/dumper2.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';

used this command to dump a .csv file of table content 

'/var/lib/mysql-files/ is the only valid place to dump the file when the secure-file-priv-option is set.

https://stackoverflow.com/questions/31951468/error-code-1290-the-mysql-server-is-running-with-the-secure-file-priv-option


I used the command cp dumper2.csv /home/search/Downloads/Datasets/dumper2.csv


Now I have all the malware dataset in one .csv file. 

I still may want to use | sed -e 's/\\\";[^"]*//g' -e 's/\\\":[^"]*//g' to clean up some junk like \";s:7:\"

find string that start with \ followed by : and match any sequence of symbols until " (then stop)

-e 's/\\n[^a-zA-Z0-9\\]*//g' -e "s/[A-Z]//g" 

sudo -i to go to superuser (maybe useful)




1st sed: Remove starting with \ then followed by " then followed by ; or : and any sequence of character until a " character... then abort.
2nd sed: Substitute
2.1 ],\ with space
2.2 }, with space
2.3 }\  with space
2.4 {\ with space
2.5 ": with space
2.6 \" with space
2.7 ] \with space
2.8 [ \with space
2.9 ,\ with space
2.10   <space>\with space
2.11 ]" with space
2.12 ] with space
2.13 [ with space
2.20" with space
2.21    :with space
2.22    {with space
2.23    }        with space
3rd sed command: Remove empty lines (had used AWK but it could not handle file size)
4th sed: Remove consecutive spaces
 
 
 

| dos2unix | sed -e 's/\\\";[^"]*//g' -e 's/\\\":[^"]*//g' | sed -e 's/\],\\/ /g' -e 's/\},/ /g' -e 's/\}\\/ /g' -e 's/{\\/ /g' -e 's/\":/ /g' -e 's/\\\"/ /g' -e 's/\]\\/ /g' -e 's/\[\\/ /g' -e 's/,\\/ /g' -e 's/ \\/ /g' -e 's/\]\"/ /g' -e 's/\]/ /g' -e 's/\[/ /g' -e 's/"/  /g' -e 's/://g' -e 's/{/ /g' -e 's/}/ /g' | sed '/^\s*$/d' | sed 's/ \+/ /g'


deciding to not use top command when indexing as it took alot of resources (memory, IO) when indexing need the memory and
IO

______________________________________________________________________________________________________

ID 59 Network IDS logs (snort)

cat /home/search/Downloads/Datasets/snort | dos2unix | xargs | sed 's/ \+/ /g' | sed 's/\[\*\*\] \[1/\n\[\*\*\] \[1/g' >> /home/search/Downloads/preproccing\ general/IDSnetworkLogID59

  1. open file with cat
  2. dos2unix converts line endings from MAC/Windows to UNIX line endings
  3. xargs subsitutes newlines with space 
  4. Reduce consecutive spaces to just 1 space
  5. Now we have all file content on 1 line. We find each occurance of "[**] [1" (which represent the start of a new entry in the log file) and substitute that with the string "\n[**] [1" which leads to each line in the file representing 1 log entry.

_

cat /home/search/Downloads/Datasets/snort |
dos2unix |
xargs |
sed 's/ \+/ /g' |
sed 's/\[\*\*\] \[1/\n\[\*\*\] \[1/g' |
sed -e 's/\],\\/ /g' -e 's/\},/ /g' -e 's/\}\\/ /g' -e 's/{\\/ /g' -e 's/\":/ /g' -e 's/\\\"/ /g' -e 's/\]\\/ /g' -e 's/\[\\/ /g' -e 's/,\\/ /g' -e 's/ \\/ /g' -e 's/\]\"/ /g' -e 's/\]/ /g' -e 's/\[/ /g' -e 's/"/ /g' -e 's/://g' -e 's/{/ /g' -e 's/}/ /g' |
sed '/^\s*$/d' |
sed 's/\*/ /g' |
sed 's/ \+/ /g'|
perl -pe 's/[^[:ascii:]]//g;' |
tr -cd '\12\40-\176' | sed -e 's/\\/\\\\/g' | split -l 1000 -d - /home/search/Downloads/Datasets/batchSnort/snortFile




______________________________________________________________________________________________________

Hillary clinton emails: 

https://stackoverflow.com/questions/1251999/how-can-i-replace-a-newline-n-using-sed

https://unix.stackexchange.com/questions/196780/is-there-an-alternative-to-sed-that-supports-unicode

https://stackoverflow.com/questions/657846/how-do-i-write-non-ascii-characters-using-echo