______________________________________________________________________________________________________________________________________________
Andri malware dataset:
I should state in my thesis Is that I got this dataset by requesting my supervisor Andrii for it.
Downloaded from link in virtual machine
unzipped with command 7z x
unzipped all induvdual sql files
added all unzipped sql files to a folder
maybe concatenate all files in the folder...Or I could preproccess and index all the subfiles seperatly ... as one big document might be to big to process.... One of the subfiles are already to much for gedit to handle.
when using wc -l on one of the files I got 10626 lines
I wanted to check how many of the lines ("/n") are empty lines with the command grep -cvP '\S' and it shown 0. I assume that entries are on seprate lines?
I also counted with cat doc.sql | dos2unix | grep -cvP '\n,' and i got 10626 ... it seems that every new line starts with , which makes sense that new entries in insert are separated with ,
Maby do this:
I should substitute character "
Elasticsearch preprossing or Solr preproccing
Renamed scema to A-Schema.sql to make it come first in alphabetic order.
used mysql -u root -p
Enter password <admin>
then
mysql > create database dumper2;
then
ls -1 *.sql | awk '{ print "source",$0 }' | mysql --batch -u root -p dumper2
Enter password <admin>
SELECT table_schema "DB Name",
Round(Sum(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB"
FROM information_schema.tables
GROUP BY table_schema;
Used this command to calculate the size of the database to see if the import was corrctly
23.3578 GB was imported which seems about right.
Next I should figure ot how to dump this file to a csv file.
SELECT * FROM rawData
INTO OUTFILE '/var/lib/mysql-files/dumper2.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
used this command to dump a .csv file of table content
'/var/lib/mysql-files/ is the only valid place to dump the file when the secure-file-priv-option is set.
I used the command cp dumper2.csv /home/search/Downloads/Datasets/dumper2.csv
Now I have all the malware dataset in one .csv file.
I still may want to use | sed -e 's/\\\";[^"]*//g' -e 's/\\\":[^"]*//g' to clean up some junk like \";s:7:\"
find string that start with \ followed by : and match any sequence of symbols until " (then stop)
-e 's/\\n[^a-zA-Z0-9\\]*//g' -e "s/[A-Z]//g"
sudo -i to go to superuser (maybe useful)
1st sed: Remove starting with \ then followed by " then followed by ; or : and any sequence of character until a " character... then abort. 2nd sed: Substitute 2.1 ],\ with space 2.2 }, with space 2.3 }\ with space 2.4 {\ with space 2.5 ": with space 2.6 \" with space 2.7 ] \with space 2.8 [ \with space 2.9 ,\ with space 2.10 <space>\with space 2.11 ]" with space 2.12 ] with space 2.13 [ with space 2.20" with space 2.21 :with space 2.22 {with space 2.23 } with space 3rd sed command: Remove empty lines (had used AWK but it could not handle file size) 4th sed: Remove consecutive spaces | dos2unix | sed -e 's/\\\";[^"]*//g' -e 's/\\\":[^"]*//g' | sed -e 's/\],\\/ /g' -e 's/\},/ /g' -e 's/\}\\/ /g' -e 's/{\\/ /g' -e 's/\":/ /g' -e 's/\\\"/ /g' -e 's/\]\\/ /g' -e 's/\[\\/ /g' -e 's/,\\/ /g' -e 's/ \\/ /g' -e 's/\]\"/ /g' -e 's/\]/ /g' -e 's/\[/ /g' -e 's/"/ /g' -e 's/://g' -e 's/{/ /g' -e 's/}/ /g' | sed '/^\s*$/d' | sed 's/ \+/ /g' |
______________________________________________________________________________________________________
ID 59 Network IDS logs (snort)
cat /home/search/Downloads/Datasets/snort | dos2unix | xargs | sed 's/ \+/ /g' | sed 's/\[\*\*\] \[1/\n\[\*\*\] \[1/g' >> /home/search/Downloads/preproccing\ general/IDSnetworkLogID59
_______________________________________________________________________________________________________