Details on datasets

Version 4 - Updated on 16 Nov 2017 at 2:46AM by Joachim Hansen

Description

I modified code from https://coding-school.com/awk-line-length-and-average/ to get differnt line count frequencies ... how many lines have character length of 1-9... 

awk ' BEGIN{upTo10=0; upTo100=0; upTo500=0; upTo1000=0; upTo5000=0; upTo10000=0; upTo50000=0; upTo100000=0; upTo500000=0; upToMore=0;
totlen=0;}
 { thislen=length($0); totlen+=thislen;
if (thislen <= 10) upTo10++;

else if (thislen >10 && thislen <= 100) upTo100++;  

else if (thislen >100 && thislen <=500) upTo500++;

else if (thislen >500 && thislen <=1000) upTo1000++;

else if (thislen >1000 && thislen <=5000) upTo5000++;

else if (thislen >5000 && thislen <=10000) upTo10000++;

else if (thislen >10000 && thislen <=50000) upTo50000++;

else if (thislen >50000 && thislen <=100000) upTo100000++;

else if (thislen >100000 && thislen <=500000) upTo500000++;

else upToMore++;

}
END { printf("average: %d\n", totlen/NR);

printf("Number of lines: %d\n",NR);

printf("Line length 0-9: %d\n",upTo10);

printf("Line length 11-100: %d\n",upTo100);

printf("Line length 101-500: %d\n",upTo500);

printf("Line length 501-1000: %d\n",upTo1000);

printf("Line length 1001-5000: %d\n",upTo5000);

printf("Line length 5001-10000: %d\n",upTo10000);

printf("Line length 10001-50000: %d\n",upTo50000);

printf("Line length 50001-100000: %d\n",upTo100000);

printf("Line length 100001-500000: %d\n",upTo500000);

printf("Longer line lengths: %d\n",upToMore);

} '


Example output below: 

        average: 12

Number of lines: 1577862522
Line length 0-9: 1405172471
Line length 11-100: 169532329
Line length 101-500: 2142802
Line length 501-1000: 106838
Line length 1001-5000: 472537
Line length 5001-10000: 34165
Line length 10001-50000: 397470
Line length 50001-100000: 2011
Line length 100001-500000: 1460
Longer line lengths: 439

------------------------------------------------------------------------------------------------------


Now I have the number of documents as well as the sizes of the documents...

This data is interesting when it comes to the index size.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Dumper2 preprocessed:


average: 10
Number of lines: 1561880670
Line length 0-9: 1407349617
Line length 11-100: 152165026
Line length 101-500: 1369876
Line length 501-1000: 252884
Line length 1001-5000: 330397
Line length 5001-10000: 340029
Line length 10001-50000: 70341
Line length 50001-100000: 1087
Line length 100001-500000: 989
Longer line lengths: 424