Update README.md

author: drbaggy <js5@sanger.ac.uk> 2021-04-28 00:48:16 +0100
committer: drbaggy <js5@sanger.ac.uk> 2021-04-28 00:48:16 +0100
commit: c4f6f131cf58633c068d07d8888abfe7ffc7000a (patch)
tree: 3521b6bb23a3bcb81204e78c08256b2e67558b8f
parent: 6ab10bd9b8703390dbfb1b058cbed79844b87984 (diff)
download: perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.tar.gz
perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.tar.bz2
perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.zip
1 files changed, 66 insertions, 34 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md
index 8e08f1b553..c9e2e1d24b 100644
--- a/challenge-110/james-smith/README.md
+++ b/challenge-110/james-smith/README.md
@@ -1,3 +1,5 @@
+# Perl Weekly Challenge #110
+
 # Challenge 1 - valid phone numbers...
 
 You are given a text file - Write a script to display all valid phone numbers in the given text file.
@@ -20,11 +22,33 @@ We group the three prefix patterns into a group match with `(`s and `|`s - remem
 we save memory by not storing the match.
 
 We wrap this regex in a function call:
+
 ``` perl
 sub is_valid_phone_number {
   return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z};
 }
 ```
+
+or "commented" using the "x" modifier...
+
+``` perl
+sub is_valid_phone_number {
+  return m{
+    \A          # Start of line
+    \s*         # Possibly white-space
+    (?:         # Prefix - one of:
+      [+]\d+ |  #   +{digits}
+      00\d+  |  #   00{digits}
+      [(]\d+[)] #   ({digits})
+    )
+    \s+         # Some white-space
+    \d+         # String of numbers
+    \s*         # Possibly white-space
+    \Z          # End of line
+  }x;
+}
+```
+
 We can then just use this to grep over the lines of the file....
 
 ``` perl
@@ -57,7 +81,7 @@ show the pythonistas that actually Perl was still a better language for this sor
 Investigating the problem I realised that the method they were using was a slurp and print model....
 The problem with that for such large files was memory. Once slurped, chopped etc the machine was
 swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3
-methods of performing this
+methods of performing this.
 
 ## Solution 2a - The simplest solution - load in and split into arrays of arrays.
 
@@ -72,11 +96,13 @@ is a simple one liner.....
 
 ``` perl
 sub transpose_split {
+  ## Slurp into array
   open my $fh, '<', $_[0];
-  my @in =  map { chomp;[ split /,/ ] } <$fh>;                 ## Slurp into array
+  my @in =  map { chomp;[ split /,/ ] } <$fh>;
   close $fh;
+  ## Generate transpose;
   open $fh, '>', $_[1];
-  say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]};  ## Generate transpose;
+  say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]};
   close $fh;
 }
 ```
@@ -95,9 +121,11 @@ strings are empty.
 
 ``` perl
 sub transpose_regex {
+  ## Slurp into array
   open my $fh, '<', $_[0];
-  my @in = <$fh>;                                              ## Slurp into array
+  my @in = <$fh>;
   close $fh;
+  ## Generate transpose;
   open $fh, '>', $_[1];
   say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0];
   close $fh;
@@ -139,10 +167,14 @@ sub transpose_seek {
   my($prev,@pos) = (0);
   open my $fh,  '<', $_[0];
   open my $ofh, '>', $_[1];
-  ## Loop through the file and get the start/end of each line
+  ## Loop through the file and get the start/end position of each line,
+  ## and the first $BYTES characters of each line...
   push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) &&
        ( ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]), $prev=tell $fh) while <$fh>;
 
+  ## While we still have "columns" loop through each row and grab the first
+  ## entry and output results.
+
   while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) {
     my @line;
     foreach(@pos) {
@@ -241,42 +273,42 @@ sub transpose_seek {
 
 The following are timings on a single core, 2G RAM, 4G swap machine:
 
-| Method/size | Time (s) | Kbytes | resident | shared |
-| ----------- | -------: | -----: | -------: | -----: |
-| Seek small  | 0.001 | 16016| 7836| 5228 |
-| Regex small | 0.000 | 16016| 7836| 5228 |
-| Split small | 0.000 | 16016| 7836| 5228 |
-| Seek 1000   | 1.346 | 17388| 9320| 5228 |
-| Seek 2000   | 5.841 | 18848| 10636| 5228 |
-| Seek 5000   | 54.208 | 23044| 14972| 5228 |
-| Regex 1000  | 1.293 | 25492| 17288| 5228 |
-| Seek 30000  | 3003.220 | 57312| 43948| 2720 |
-| Regex 2000  | 9.040 | 63896| 51376| 3140 |
-| Split 1000  | 0.934 | 105784| 93100| 3204 |
-| Regex 5000  | 130.411 | 260432| 248016| 3204 |
-| Split 2000  | 6.780 | 362028| 349388| 3204 |
-| Split 5000  | 527.614 | 2153576| 1423468| 2764 |
+| Method/size | Time (s) | Kbytes  | resident | shared |
+| ----------- | -------: | -----:  | -------: | -----: |
+| Seek small  |    0.001 |   16016 |     7836 |   5228 |
+| Regex small |    0.000 |   16016 |     7836 |   5228 |
+| Split small |    0.000 |   16016 |     7836 |   5228 |
+| Seek 1000   |    1.346 |   17388 |     9320 |   5228 |
+| Seek 2000   |    5.841 |   18848 |    10636 |   5228 |
+| Seek 5000   |   54.208 |   23044 |    14972 |   5228 |
+| Regex 1000  |    1.293 |   25492 |    17288 |   5228 |
+| Seek 30000  | 3003.220 |   57312 |    43948 |   2720 |
+| Regex 2000  |    9.040 |   63896 |    51376 |   3140 |
+| Split 1000  |    0.934 |  105784 |    93100 |   3204 |
+| Regex 5000  |  130.411 |  260432 |   248016 |   3204 |
+| Split 2000  |    6.780 |  362028 |   349388 |   3204 |
+| Split 5000  |  527.614 | 2153576 |  1423468 |   2764 |
 
 The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels).
 
 File sizes:
 
-| name  | size | row size |
-| ----- | -----: | ----: |
-| small | 61 bytes | 12 |
-|  1000 | 6.6 Mbytes | 6.7K |
-|  2000 | 27 Mbytes | 13.5K |
-|  5000 | 165 Mbytes | 33.6K |
-| 30000 | 5.8 Gbytes | 201.0K |
+| name         | size       | row size |
+| ------------ | ---------: | -------: |
+| in-small.txt |   61 bytes |       12 |
+|  in-1000.txt | 6.6 Mbytes |     6.7K |
+|  in-2000.txt |  27 Mbytes |    13.5K |
+|  in-5000.txt | 165 Mbytes |    33.6K |
+| in-30000.txt | 5.8 Gbytes |   201.0K |
 
 If we look at the timings by method we can see that for the smaller files the `split` is
 the most efficient {but the difference is relatively small}. But as the file size increases
 then it soon becomes the least efficient:
 
-| Size   | Split | Regex | Seek |
-| -----: | ----: | ----: | ----: |
-| small  | **0.000** | 0.000 | *0.001* |
-| 1000   | **0.934** | 1.293 | *1.346* |
-| 2000   | 6.890 | *9.040* | **5.841** |
-| 5000   | *527.614* | 130.411 | **54.208** |
-| 30000  | - | - | **3003.220** |
+| Size   | Split       | Regex       | Seek         |
+| -----: | ----------: | ----------: | -----------: |
+| small  |   **0.000** |     0.000   |     *0.001*  |
+| 1000   |   **0.934** |     1.293   |     *1.346*  |
+| 2000   |     6.890   |    *9.040*  |    **5.841** |
+| 5000   |  *527.614*  |   130.411   |   **54.208** |
+| 30000  |         -   |         -   | **3003.220** |
author	drbaggy <js5@sanger.ac.uk>	2021-04-28 00:48:16 +0100
committer	drbaggy <js5@sanger.ac.uk>	2021-04-28 00:48:16 +0100
commit	c4f6f131cf58633c068d07d8888abfe7ffc7000a (patch)
tree	3521b6bb23a3bcb81204e78c08256b2e67558b8f
parent	6ab10bd9b8703390dbfb1b058cbed79844b87984 (diff)
download	perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.tar.gz perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.tar.bz2 perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.zip