First copy of "blog"

author: James Smith <js5@sanger.ac.uk> 2021-04-27 11:34:46 +0100
committer: GitHub <noreply@github.com> 2021-04-27 11:34:46 +0100
commit: d6e47833e1d56c4e3f698b04af5228b0157106b5 (patch)
tree: a7625d6c25a643c1871fd6b39b5e47c3491bd3f7
parent: 0b20bcb44f62e12d33cc4080e45f93a47b0cddb6 (diff)
download: perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.tar.gz
perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.tar.bz2
perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.zip
1 files changed, 229 insertions, 1 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md
index 3b736748ea..26fdc5d217 100644
--- a/challenge-110/james-smith/README.md
+++ b/challenge-110/james-smith/README.md
@@ -1 +1,229 @@
-Solutions by James Smith.
+# Challenge 1 - valid phone numbers...
+
+You are given a text file - Write a script to display all valid phone numbers in the given text file.
+
+## Solution 1
+
+This is what Perl was designed for - pattern matching strings. The format we accept numbers is one of:
+
+```
+   +nn  nnnnnnnnnn
+   (nn) nnnnnnnnnn
+   nnnn nnnnnnnnnn
+```
+
+We can write this as a single regex - there are two parts to the pattern - the prefix and the number
+itself. The latter is just a simple number match {we don't specify length as in some countries this
+is of different lengths}.
+
+We group the three prefix patterns into a group match with `(`s and `|`s - remembering to add `?:` so
+we save memory by not storing the match.
+
+We wrap this regex in a function call:
+``` perl
+sub is_valid_phone_number {
+  return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z};
+}
+```
+We can then just use this to grep over the lines of the file....
+
+``` perl
+print grep { is_valid_phone_number($_) } <>;
+```
+
+note we use `print` rather than `say` as we haven't stripped the return character from the line!
+
+# Challenge 2 - transposing a file
+
+You are given a text file. Write a script to transpose the contents of the given file.
+
+## Some background - I've seen this before!
+
+Now this is a problem I have come across before - I work at a genomic institute which handles large
+amounts of data - some of these are in the forms of very large tab or comma separated value files
+containing many 1000s of rows and many 1000s of columns, e.g. gene expression vs samples, genes vs
+variations. These files can be 10G or larger in size....
+
+Often these files are generated by automatic processes - but when you want to perform the next step
+you would rather work down columns rather than across rows...!
+
+The problem as was posed on a Python mailing list is that this was taking for ever to run.... 
+Firstly moving back to python 2 helped, but unless the scripts were run on big memory machine {64G+}
+then it was still taking a long time...
+
+As a non-python developer on the list - I looked to see if this could be easily achieved in Perl, to
+show the pythonistas that actually Perl was still a better language for this sort of thing.
+
+Investigating the problem I realised that the method they were using was a slurp and print model....
+The problem with that for such large files was memory. Once slurped, chopped etc the machine was
+swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3
+methods of performing this
+
+## Solution 2a - The simplest solution - load in and split into arrays of arrays.
+
+The simplest solution is to slurp the data in the file into an array of array(ref)s.
+
+This is just a case of a `map` which splits the line into it's chunks - we use chomp
+to remove any unneeded line endings.
+
+Once we have an array of arrays we take the first entry out of each row and printing
+these in one line. We repeat this until we reach the end of the lines... Again this
+is a simple one liner.....
+
+``` perl
+sub transpose_split {
+  open my $fh, '<', $_[0];
+  my @in =  map { chomp;[ split /,/ ] } <$fh>;                 ## Slurp into array
+  close $fh;
+  open $fh, '>', $_[1];
+  say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]};  ## Generate transpose;
+  close $fh;
+}
+```
+
+For an 1000 x 1000 file (6.75Mb), this uses approximately 90Mb of memory. This is
+approximately 12-15 x space multiplier..
+
+## Solution 2b - Load in but don't split into arrays
+
+This time we want to look at how we can save memory - one of the problems with the
+solution above is the huge overhead of storing data in an array.
+
+To resolve this we can just slurp all the data into the array, and then use a regex
+to strip the characters up to the `,` or "end of line" and output them, until the
+strings are empty.
+
+``` perl
+sub transpose_regex {
+  open my $fh, '<', $_[0];
+  my @in = <$fh>;                                              ## Slurp into array
+  close $fh;
+  open $fh, '>', $_[1];
+  say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0];
+  close $fh;
+}
+```
+This is more efficient - for the 1000 x 1000 example it only uses approx 9Mb of
+memory this is on only about a 1.33 x space multiplier... This still becomes more
+of an issue as the file size gets larger
+
+## Solution 2c - The ultimate large file solution - using `tell`, `seek` and `read`;
+
+Finally a solution for "very" large-files...
+
+Rather than slurping all the data into memory in one go - we instead pull content of
+each of the lines a few bytes at a time, working with that - and going back to the
+file system to retrieve more data.
+
+Perl has some simple functions for moving the "head" around in the file:
+
+  * `tell` - return the current location in the file
+  * `seek` - set the current location in the file.
+
+Additionally it has another way of accessing the file content rather than `<$fh>`,
+by using the lower level `read` command. This reads contents for the file and
+stores it into a scalar variable.
+
+So this requires a "double" pass solution - first of all we use `<$fh>` to retrieve
+every line of data from the file, noting the start and end of the line {in byte offsets} and
+remebering the first `$BYTES` bytes from that line {in this example we set this to 1000 bytes}.
+
+We then use a similar approach to the previous one - where we use a regex to find the value
+of the first column. But if we can't find a comma OR the end of the line - we retrieve another
+`$BYTES` bytes of data to the string (until we can find a comma or the end of the line).
+
+If we get to the end of the line we only retrieve the number of bytes left in that line.
+
+``` perl
+sub transpose_seek {
+  my($prev,@pos) = (0);
+  open my $fh,  '<', $_[0];
+  open my $ofh, '>', $_[1];
+  ## Loop through the file and get the start/end of each line
+  push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) &&
+       ( ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]), $prev=tell $fh) while <$fh>;
+
+  while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) {
+    my @line;
+    foreach(@pos) {
+      while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) {
+        seek $fh, $_->[0], 0;
+        read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2];
+        $_->[0] = tell $fh;
+      }
+      $_->[2] =~ s{^([^,\r\n]+)[,\r\n]*}{};
+      push @line, $1;
+    }
+    say ${$ofh} join q(,), @line;
+  }
+}
+```
+
+### Notes on this code...
+
+ * `@pos` is an array of triples:
+
+   * **`[0]`** - the current offset in the file for the next chunk of this line;
+
+   * **`[1]`** - the end of the current line;
+
+   * **`[2]`** - the buffer for this line (initially the first `$BYTES` bytes)
+
+ * Let's look at the three while loops....
+  
+    * The push/while line slurping in data
+
+      We user `while <$fh>` here rather than `foreach` as `while` doesn't accumulate
+      data into an array, where as `foreach <$fh>` does, saving memory.
+
+``` perl
+while( <$fh> ) {
+  push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) &&
+       (
+         ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]),
+         $prev=tell $fh
+       ) 
+}
+```
+      This uses a lot of ***1-liner*** tricks - first of all we push the start, end and first
+      `$BYTES` bytes into the @pos array. This always evaluates to true, so we run the second
+      bracketed statement. This (a) sets the start of the next block to the end of this current
+      line if we have overshot, and then (b) updates the `$prev` value so we know where the
+      next line starts in the file....
+    
+    * Now let us look at the 2nd while loop (outer)...
+
+``` perl
+  while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) {
+
+  }
+```
+      This is checking that (a) we haven't run out of data to be retrieved for the first and
+      (b) haven't run out of data we have already retrieved.
+
+      
+    * Finally the 3rd while loop (inner)... 
+
+``` perl
+      while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) {
+        seek $fh, $_->[0], 0;
+        read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2];
+        $_->[0] = tell $fh;
+      } 
+```
+      In this loop we see if the row does not contain a comma AND there is data left... If this is the
+      case we have to retrieve more data from the file. We do this by first `seek`ing to the location
+      in the file that we need to get data from. We then retrieve the either $BYTES `bytes` of data (or
+      all the data left for the row {if it is less than `$BYTES` bytes.}
+      We then update the location for that particular row (using `tell`).
+
+      Note also we use the 4 parameter version of read.
+
+         `read $fh, $buffer, $bytes, $offset`
+     
+      By adding the offset - we can easily append this content onto the end of our buffer string. We have
+      to use length `$_->[2]` as you can use -ve indecies to read into the buffer with an offset from the
+      end - but this only works for -1, -2 etc not "-0".
+
+    * We then use the regex trick to get the first column of the data.
+
author	James Smith <js5@sanger.ac.uk>	2021-04-27 11:34:46 +0100
committer	GitHub <noreply@github.com>	2021-04-27 11:34:46 +0100
commit	d6e47833e1d56c4e3f698b04af5228b0157106b5 (patch)
tree	a7625d6c25a643c1871fd6b39b5e47c3491bd3f7
parent	0b20bcb44f62e12d33cc4080e45f93a47b0cddb6 (diff)
download	perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.tar.gz perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.tar.bz2 perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.zip