diff options
| author | drbaggy <js5@sanger.ac.uk> | 2021-04-28 01:11:07 +0100 |
|---|---|---|
| committer | drbaggy <js5@sanger.ac.uk> | 2021-04-28 01:11:07 +0100 |
| commit | eed3560d163299335d7afcd3ed2532d00143db09 (patch) | |
| tree | 5711f91f3ea73cadadce1b7aaea65f03b9993008 | |
| parent | cf96d0a1ab3c94711d2787022e7df1b3a38a8805 (diff) | |
| download | perlweeklychallenge-club-eed3560d163299335d7afcd3ed2532d00143db09.tar.gz perlweeklychallenge-club-eed3560d163299335d7afcd3ed2532d00143db09.tar.bz2 perlweeklychallenge-club-eed3560d163299335d7afcd3ed2532d00143db09.zip | |
Update README.md
| -rw-r--r-- | challenge-110/james-smith/README.md | 68 |
1 files changed, 37 insertions, 31 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md index efd8fb6015..433fbf84f7 100644 --- a/challenge-110/james-smith/README.md +++ b/challenge-110/james-smith/README.md @@ -217,10 +217,7 @@ sub transpose_seek { ``` perl while( <$fh> ) { push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) && - ( - ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]), - $prev=tell $fh - ); + ( $prev=tell $fh ); } ``` @@ -247,31 +244,34 @@ sub transpose_seek { while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { seek $fh, $_->[0], 0; read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2]; - $_->[0] = tell $fh; + $_->[0] += $BYTES; } ``` In this loop we see if the row does not contain a comma AND there is data left... If this is the case we have to retrieve more data from the file. We do this by first `seek`ing to the location in the file that we need to get data from. We then retrieve the either $BYTES `bytes` of data (or all the data left for the row {if it is less than `$BYTES` bytes.} - We then update the location for that particular row (using `tell`). + We then update the location for that particular row (by adding `$BYTES` we can ignore the fact + that we overshot. Note also we use the 4 parameter version of read. `read $fh, $buffer, $bytes, $offset` By adding the offset - we can easily append this content onto the end of our buffer string. We have - to use length `$_->[2]` as you can use -ve indecies to read into the buffer with an offset from the + to use `length $_->[2]` as you can use -ve indecies to read into the buffer with an offset from the end - but this only works for -1, -2 etc not "-0". - * We then use the regex trick to get the first column of the data. + * We then use the regex trick in 2b to get the first column of the data. * Memory usage: * This script does not load the file all in one go - so really needs a lot less memory (vs more disc accesses). It is linear in the number of lines, e.g. for the 1000 line file we load in roughly 1Mb of data at a time, and the memory usage is roughly 1.3Mb. + * Note this is `O(n)` as well as if the rows get longer then the number of bytes used does not increase. + * Having played a bit - the sweet spot of `$BYTES` lies somewhere between 1K and 2K. Smaller makes the regex in the split more efficient, larger reduces the file IO overhead. @@ -279,25 +279,29 @@ sub transpose_seek { The following are timings on a single core, 2G RAM, 4G swap machine: -| Method/size | Time (s) | Kbytes | resident | shared | -| ----------- | -------: | -----: | -------: | -----: | -| Seek small | 0.001 | 16016 | 7836 | 5228 | -| Regex small | 0.000 | 16016 | 7836 | 5228 | -| Split small | 0.000 | 16016 | 7836 | 5228 | -| Seek 1000 | 1.346 | 17388 | 9320 | 5228 | -| Seek 2000 | 5.841 | 18848 | 10636 | 5228 | -| Seek 5000 | 54.208 | 23044 | 14972 | 5228 | -| Regex 1000 | 1.293 | 25492 | 17288 | 5228 | -| Seek 30000 | 3003.220 | 57312 | 43948 | 2720 | -| Regex 2000 | 9.040 | 63896 | 51376 | 3140 | -| Split 1000 | 0.934 | 105784 | 93100 | 3204 | -| Regex 5000 | 130.411 | 260432 | 248016 | 3204 | -| Split 2000 | 6.780 | 362028 | 349388 | 3204 | -| Split 5000 | 527.614 | 2153576 | 1423468 | 2764 | +**Timings:** + +We list these in order of "memory consumption"... + +| Method/size | Time (s) | Kbytes | resident | shared | +| ----------- | --------: | --------: | --------: | -----: | +| Seek small | 0.000 | 16,016 | 7,836 | 5,228 | +| Regex small | 0.000 | 16,016 | 7,836 | 5,228 | +| Split small | 0.000 | 16,016 | 7,836 | 5,228 | +| Seek 1000 | 1.346 | 17,388 | 9,320 | 5,228 | +| Seek 2000 | 5.841 | 18,848 | 10,636 | 5,228 | +| Seek 5000 | 54.208 | 23,044 | 14,972 | 5,228 | +| Regex 1000 | 1.293 | 25,492 | 17,288 | 5,228 | +| Seek 30000 | 3,003.220 | 57,312 | 43,948 | 2,720 | +| Regex 2000 | 9.040 | 63,896 | 51,376 | 3,140 | +| Split 1000 | 0.934 | 105,784 | 93,100 | 3,204 | +| Regex 5000 | 130.411 | 260,432 | 248,016 | 3,204 | +| Split 2000 | 6.780 | 362,028 | 349,388 | 3,204 | +| Split 5000 | 527.614 | 2,153,576 | 1,423,468 | 2,764 | The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels). -File sizes: +**File sizes:** | name | rows | columns | size | row size | | ------------ | -----: | ------: | ---------: | -------: | @@ -311,10 +315,12 @@ If we look at the timings by method we can see that for the smaller files the `s the most efficient {but the difference is relatively small}. But as the file size increases then it soon becomes the least efficient: -| Size | Split | Regex | Seek | -| -----: | ----------: | ----------: | -----------: | -| small | **0.000** | 0.000 | *0.001* | -| 1000 | **0.934** | 1.293 | *1.346* | -| 2000 | 6.890 | *9.040* | **5.841** | -| 5000 | *527.614* | 130.411 | **54.208** | -| 30000 | - | - | **3003.220** | +**Comparisons:** + +| Size | Split | Regex | Seek | +| -----: | ----------: | ----------: | ------------: | +| small | **0.000** | 0.000 | *0.000* | +| 1000 | **0.934** | 1.293 | *1.346* | +| 2000 | 6.890 | *9.040* | **5.841** | +| 5000 | *527.614* | 130.411 | **54.208** | +| 30000 | - | - | **3,003.220** | |
