I modified code from https://coding-school.com/awk-line-length-and-average/ to get differnt line count frequencies ... how many lines have character length of 1-9...
awk ' BEGIN{upTo10=0; upTo100=0; upTo500=0; upTo1000=0; upTo5000=0; upTo10000=0; upTo50000=0; upTo100000=0; upTo500000=0; upToMore=0; totlen=0;} { thislen=length($0); totlen+=thislen; if (thislen <= 10) upTo10++; else if (thislen >10 && thislen <= 100) upTo100++; else if (thislen >100 && thislen <=500) upTo500++; else if (thislen >500 && thislen <=1000) upTo1000++; else if (thislen >1000 && thislen <=5000) upTo5000++; else if (thislen >5000 && thislen <=10000) upTo10000++; else if (thislen >10000 && thislen <=50000) upTo50000++; else if (thislen >50000 && thislen <=100000) upTo100000++; else if (thislen >100000 && thislen <=500000) upTo500000++; else upToMore++; } END { printf("average: %d\n", totlen/NR); printf("Number of lines: %d\n",NR); printf("Line length 0-9: %d\n",upTo10); printf("Line length 11-100: %d\n",upTo100); printf("Line length 101-500: %d\n",upTo500); printf("Line length 501-1000: %d\n",upTo1000); printf("Line length 1001-5000: %d\n",upTo5000); printf("Line length 5001-10000: %d\n",upTo10000); printf("Line length 10001-50000: %d\n",upTo50000); printf("Line length 50001-100000: %d\n",upTo100000); printf("Line length 100001-500000: %d\n",upTo500000); printf("Longer line lengths: %d\n",upToMore); } ' |
Example output below:
average: 12
Number of lines: 1577862522
Line length 0-9: 1405172471
Line length 11-100: 169532329
Line length 101-500: 2142802
Line length 501-1000: 106838
Line length 1001-5000: 472537
Line length 5001-10000: 34165
Line length 10001-50000: 397470
Line length 50001-100000: 2011
Line length 100001-500000: 1460
Longer line lengths: 439
------------------------------------------------------------------------------------------------------
Now I have the number of documents as well as the sizes of the documents...
This data is interesting when it comes to the index size.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dumper2 preprocessed:
average: 10
Number of lines: 1561880670
Line length 0-9: 1407349617
Line length 11-100: 152165026
Line length 101-500: 1369876
Line length 501-1000: 252884
Line length 1001-5000: 330397
Line length 5001-10000: 340029
Line length 10001-50000: 70341
Line length 50001-100000: 1087
Line length 100001-500000: 989
Longer line lengths: 424