diff options
| author | James Smith <js5@sanger.ac.uk> | 2021-04-27 12:00:58 +0100 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2021-04-27 12:00:58 +0100 |
| commit | 41b7248730a5f8618a9bd87edd26aecb18d548de (patch) | |
| tree | 9401a08284703620317f36d27b89499ad1a5e66c | |
| parent | 822eb198b1a402d00864f9318ecad7e635593a44 (diff) | |
| download | perlweeklychallenge-club-41b7248730a5f8618a9bd87edd26aecb18d548de.tar.gz perlweeklychallenge-club-41b7248730a5f8618a9bd87edd26aecb18d548de.tar.bz2 perlweeklychallenge-club-41b7248730a5f8618a9bd87edd26aecb18d548de.zip | |
Update README.md
| -rw-r--r-- | challenge-110/james-smith/README.md | 49 |
1 files changed, 49 insertions, 0 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md index 6aaba4d212..014a7cd5d7 100644 --- a/challenge-110/james-smith/README.md +++ b/challenge-110/james-smith/README.md @@ -228,3 +228,52 @@ sub transpose_seek { * We then use the regex trick to get the first column of the data. + * Memory usage: + * This script does not load the file all in one go - so really needs a lot less memory + (vs more disc accesses). It is linear in the number of lines, e.g. for the 1000 line file we load in + roughly 1Mb of data at a time, and the memory usage is roughly 1.3Mb. + * Note this is `O(n)` as well as if the rows get longer then the number of bytes used does not increase. + +### Some information about speed/memory etc... + +The following are timings on a single core, 2G RAM, 4G swap machine: + +| Method/size | Time (s) | Kbytes | resident | shared | +| ----------- | -------: | -----: | -------: | -----: | +| Seek small | 0.001 | 16016| 7836| 5228 | +| Regex small | 0.000 | 16016| 7836| 5228 | +| Split small | 0.000 | 16016| 7836| 5228 | +| Seek 1000 | 1.346 | 17388| 9320| 5228 | +| Seek 2000 | 5.841 | 18848| 10636| 5228 | +| Seek 5000 | 54.208 | 23044| 14972| 5228 | +| Regex 1000 | 1.293 | 25492| 17288| 5228 | +| Seek 30000 | 3003.220 | 57312| 43948| 2720 | +| Regex 2000 | 9.040 | 63896| 51376| 3140 | +| Split 1000 | 0.934 | 105784| 93100| 3204 | +| Regex 5000 | 130.411 | 260432| 248016| 3204 | +| Split 2000 | 6.780 | 362028| 349388| 3204 | +| Split 5000 | 527.614 | 2153576| 1423468| 2764 | + +The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels). + +File sizes: + +| name | size | row size | +| ----- | -----: | ----: | +| small | 61 bytes | 12 | +| 1000 | 6.6 Mbytes | 6.7K | +| 2000 | 27 Mbytes | 13.5K | +| 5000 | 165 Mbytes | 33.6K | +| 30000 | 5.8 Gbytes | 201.0K | + +If we look at the timings by method we can see that for the smaller files the `split` is +the most efficient {but the difference is relatively small}. But as the file size increases +then it soon becomes the least efficient: + +| Size | Split | Regex | Seek | +| -----: | ----: | ----: | ----: | +| small | **0.000** | 0.000 | *0.001* | +| 1000 | **0.934** | 1.293 | *1.346* | +| 2000 | 6.890 | *9.040* | **5.841** | +| 5000 | *527.614* | 130.411 | **54.208** | +| 30000 | - | - | **3003.220** | |
