aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJames Smith <js5@sanger.ac.uk>2021-04-27 12:00:58 +0100
committerGitHub <noreply@github.com>2021-04-27 12:00:58 +0100
commit41b7248730a5f8618a9bd87edd26aecb18d548de (patch)
tree9401a08284703620317f36d27b89499ad1a5e66c
parent822eb198b1a402d00864f9318ecad7e635593a44 (diff)
downloadperlweeklychallenge-club-41b7248730a5f8618a9bd87edd26aecb18d548de.tar.gz
perlweeklychallenge-club-41b7248730a5f8618a9bd87edd26aecb18d548de.tar.bz2
perlweeklychallenge-club-41b7248730a5f8618a9bd87edd26aecb18d548de.zip
Update README.md
-rw-r--r--challenge-110/james-smith/README.md49
1 files changed, 49 insertions, 0 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md
index 6aaba4d212..014a7cd5d7 100644
--- a/challenge-110/james-smith/README.md
+++ b/challenge-110/james-smith/README.md
@@ -228,3 +228,52 @@ sub transpose_seek {
* We then use the regex trick to get the first column of the data.
+ * Memory usage:
+ * This script does not load the file all in one go - so really needs a lot less memory
+ (vs more disc accesses). It is linear in the number of lines, e.g. for the 1000 line file we load in
+ roughly 1Mb of data at a time, and the memory usage is roughly 1.3Mb.
+ * Note this is `O(n)` as well as if the rows get longer then the number of bytes used does not increase.
+
+### Some information about speed/memory etc...
+
+The following are timings on a single core, 2G RAM, 4G swap machine:
+
+| Method/size | Time (s) | Kbytes | resident | shared |
+| ----------- | -------: | -----: | -------: | -----: |
+| Seek small | 0.001 | 16016| 7836| 5228 |
+| Regex small | 0.000 | 16016| 7836| 5228 |
+| Split small | 0.000 | 16016| 7836| 5228 |
+| Seek 1000 | 1.346 | 17388| 9320| 5228 |
+| Seek 2000 | 5.841 | 18848| 10636| 5228 |
+| Seek 5000 | 54.208 | 23044| 14972| 5228 |
+| Regex 1000 | 1.293 | 25492| 17288| 5228 |
+| Seek 30000 | 3003.220 | 57312| 43948| 2720 |
+| Regex 2000 | 9.040 | 63896| 51376| 3140 |
+| Split 1000 | 0.934 | 105784| 93100| 3204 |
+| Regex 5000 | 130.411 | 260432| 248016| 3204 |
+| Split 2000 | 6.780 | 362028| 349388| 3204 |
+| Split 5000 | 527.614 | 2153576| 1423468| 2764 |
+
+The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels).
+
+File sizes:
+
+| name | size | row size |
+| ----- | -----: | ----: |
+| small | 61 bytes | 12 |
+| 1000 | 6.6 Mbytes | 6.7K |
+| 2000 | 27 Mbytes | 13.5K |
+| 5000 | 165 Mbytes | 33.6K |
+| 30000 | 5.8 Gbytes | 201.0K |
+
+If we look at the timings by method we can see that for the smaller files the `split` is
+the most efficient {but the difference is relatively small}. But as the file size increases
+then it soon becomes the least efficient:
+
+| Size | Split | Regex | Seek |
+| -----: | ----: | ----: | ----: |
+| small | **0.000** | 0.000 | *0.001* |
+| 1000 | **0.934** | 1.293 | *1.346* |
+| 2000 | 6.890 | *9.040* | **5.841** |
+| 5000 | *527.614* | 130.411 | **54.208** |
+| 30000 | - | - | **3003.220** |