# Perl Weekly Challenge #110

# Challenge 1 - valid phone numbers...

You are given a text file - Write a script to display all valid phone numbers in the given text file.

## Solution 1

This is what Perl was designed for - pattern matching strings. The format we accept numbers is one of:

```
   +nn  nnnnnnnnnn
   (nn) nnnnnnnnnn
   nnnn nnnnnnnnnn
```

We can write this as a single regex - there are two parts to the pattern - the prefix and the number
itself. The latter is just a simple number match {we don't specify length as in some countries this
is of different lengths}.

We group the three prefix patterns into a group match with `(`s and `|`s - remembering to add `?:` so
we save memory by not storing the match.

We wrap this regex in a function call:

``` perl
sub is_valid_phone_number {
  return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z};
}
```

or "commented" using the "x" modifier...

``` perl
sub is_valid_phone_number {
  return m{
    \A          # Start of line
    \s*         # Possibly white-space
    (?:         # Prefix - one of:
      [+]\d+ |  #   +{digits}
      00\d+  |  #   00{digits}
      [(]\d+[)] #   ({digits})
    )
    \s+         # Some white-space
    \d+         # String of numbers
    \s*         # Possibly white-space
    \Z          # End of line
  }x;
}
```

We can then just use this to grep over the lines of the file....

``` perl
print grep { is_valid_phone_number($_) } <>;
```

note we use `print` rather than `say` as we haven't stripped the return character from the line!

# Challenge 2 - transposing a file

You are given a text file. Write a script to transpose the contents of the given file.

## Some background - I've seen this before!

Now this is a problem I have come across before - I work at a genomic institute which handles large
amounts of data - some of these are in the forms of very large tab or comma separated value files
containing many 1000s of rows and many 1000s of columns, e.g. gene expression vs samples, genes vs
variations. These files can be 10G or larger in size....

Often these files are generated by automatic processes - but when you want to perform the next step
you would rather work down columns rather than across rows...!

The problem as was posed on a Python mailing list is that this was taking for ever to run.... 
Firstly moving back to python 2 helped, but unless the scripts were run on big memory machine {64G+}
then it was still taking a long time...

As a non-python developer on the list - I looked to see if this could be easily achieved in Perl, to
show the pythonistas that actually Perl was still a better language for this sort of thing.

Investigating the problem I realised that the method they were using was a slurp and print model....
The problem with that for such large files was memory. Once slurped, chopped etc the machine was
swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3
methods of performing this.

## Solution 2a - The simplest solution - load in and split into arrays of arrays.

The simplest solution is to slurp the data in the file into an array of array(ref)s.

This is just a case of a `map` which splits the line into it's chunks - we use chomp
to remove any unneeded line endings.

Once we have an array of arrays we take the first entry out of each row and printing
these in one line. We repeat this until we reach the end of the lines... Again this
is a simple one liner.....

``` perl
sub transpose_split {
  ## Slurp into array
  open my $fh, '<', $_[0];
  my @in =  map { chomp;[ split /,/ ] } <$fh>;
  close $fh;
  ## Generate transpose;
  open $fh, '>', $_[1];
  say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]};
  close $fh;
}
```

For an 1000 x 1000 file (6.75Mb), this uses approximately 90Mb of memory. This is
approximately 12-15 x space multiplier..

## Solution 2b - Load in but don't split into arrays

This time we want to look at how we can save memory - one of the problems with the
solution above is the huge overhead of storing data in an array.

To resolve this we can just slurp all the data into the array, and then use a regex
to strip the characters up to the `,` or "end of line" and output them, until the
strings are empty.

``` perl
sub transpose_regex {
  ## Slurp into array
  open my $fh, '<', $_[0];
  my @in = <$fh>;
  close $fh;
  ## Generate transpose;
  open $fh, '>', $_[1];
  say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0];
  close $fh;
}
```
This is more efficient - for the 1000 x 1000 example it only uses approx 9Mb of
memory this is on only about a 1.33 x space multiplier... This still becomes more
of an issue as the file size gets larger

## Solution 2c - The ultimate large file solution - using `tell`, `seek` and `read`;

Finally a solution for "very" large-files...

Rather than slurping all the data into memory in one go - we instead pull content of
each of the lines a few bytes at a time, working with that - and going back to the
file system to retrieve more data.

Perl has some simple functions for moving the "head" around in the file:

  * `tell` - return the current location in the file
  * `seek` - set the current location in the file.

Additionally it has another way of accessing the file content rather than `<$fh>`,
by using the lower level `read` command. This reads contents for the file and
stores it into a scalar variable.

So this requires a "double" pass solution - first of all we use `<$fh>` to retrieve
every line of data from the file, noting the start and end of the line {in byte offsets} and
remebering the first `$BYTES` bytes from that line {in this example we set this to 1000 bytes}.

We then use a similar approach to the previous one - where we use a regex to find the value
of the first column. But if we can't find a comma OR the end of the line - we retrieve another
`$BYTES` bytes of data to the string (until we can find a comma or the end of the line).

If we get to the end of the line we only retrieve the number of bytes left in that line.

```perl
sub transpose_seek {
  my ( $prev, @pos ) = ( 0 );

  open my $fh,  '<', $_[0];
  ( push @pos, [ $prev+$BYTES, tell $fh, substr $_, 0, $BYTES ] ) &&
    ( $prev = tell $fh ) while <$fh>;

  open my $ofh, '>', $_[1];
  while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) {
    my $j = '';
    foreach( @pos ) {
      while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) {
        seek $fh, $_->[0], 0;  ## 0 = from start of file!
        read $fh,
             $_->[2],
             $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0],
             length $_->[2];
        $_->[0] += $BYTES;
      }
      $_->[2] =~ s{^([^,\r\n]+)[,\r\n]*}{};
      print {$ofh} $j, $1;
      $j ||= ',';
    }
    say {$ofh} '';
  }
}
```

### Notes on this code...

 * `@pos` is an array of triples:

   * **`[0]`** - the current offset in the file for the next chunk of this line;

   * **`[1]`** - the end of the current line;

   * **`[2]`** - the buffer for this line (initially the first `$BYTES` bytes)

 * Let's look at the three while loops....
  
    * The push/while line slurping in data

      We use `while <$fh>` here rather than `foreach` as `while` doesn't accumulate
      data into an array, where as `foreach <$fh>` does, saving memory.

      ``` perl
      while( <$fh> ) {
        push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) &&
             ( $prev=tell $fh );
      }
      ```

    * This uses a lot of ***1-liner*** tricks - first of all we push the start, end and first
      `$BYTES` bytes into the @pos array. This always evaluates to true, so we run the second
      bracketed statement. This (a) sets the start of the next block to the end of this current
      line if we have overshot, and then (b) updates the `$prev` value so we know where the
      next line starts in the file....
    
    * Now let us look at the 2nd while loop (outer)...

      ``` perl
        while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) {

        }
      ```
      This is checking that (a) we haven't run out of data to be retrieved for the first and
      (b) haven't run out of data we have already retrieved.

      
    * Finally the 3rd while loop (inner)... 

      ``` perl
            while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) {
              seek $fh, $_->[0], 0;
              read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2];
              $_->[0] += $BYTES;
            } 
      ```
      In this loop we see if the row does not contain a comma AND there is data left... If this is the
      case we have to retrieve more data from the file. We do this by first `seek`ing to the location
      in the file that we need to get data from. We then retrieve the either $BYTES `bytes` of data (or
      all the data left for the row {if it is less than `$BYTES` bytes.}
      We then update the location for that particular row (by adding `$BYTES` we can ignore the fact
      that we overshot.

      Note also we use the 4 parameter version of read.

         `read $fh, $buffer, $bytes, $offset`
     
      By adding the offset - we can easily append this content onto the end of our buffer string. We have
      to use `length $_->[2]` as you can use -ve indecies to read into the buffer with an offset from the
      end - but this only works for -1, -2 etc not "-0".

    * We then use the regex trick in 2b to get the first column of the data.

  * Memory usage:

    * This script does not load the file all in one go - so really needs a lot less memory
      (vs more disc accesses). It is linear in the number of lines, e.g. for the 1000 line file we load in
      roughly 1Mb of data at a time, and the memory usage is roughly 1.3Mb.

    * Rather than pushing to an array for the output, we print immediately. We use a prefix `$j` which
      starts with the value `''` so that we don't print a comma before the first entry, and changes to
      `','` after the first column (the only time `$js` is "false"). Printing every entry doesn't affect
      IO performance as the output is buffered until the print buffer is full - so both methods take the
      same time - except the array method uses subtly more memory.

    * Note this is `O(n)` as well as if the rows get longer then the number of bytes used per line does
      not increase.

    * Having played a bit - the sweet spot of `$BYTES` lies somewhere between 1K and 2K. Smaller makes the
      regex in the `split` more efficient, larger reduces the (read) file IO overhead.

### Some information about speed/memory etc...

The following are timings on a single core, 2G RAM, 4G swap machine:

**Timings:**

We list these in order of "memory consumption"...

| Method/size | Time (s)  | Kbytes    | resident  | shared |
| ----------- | --------: | --------: | --------: | -----: |
| Seek small  |     0.000 |    16,024 |     7,836 |  5,228 |
| Regex small |     0.000 |    16,024 |     7,836 |  5,228 |
| Split small |     0.000 |    16,024 |     7,836 |  5,228 |
| Seek 1000   |     1.277 |    17,196 |     9,320 |  5,228 |
| Seek 2000   |     5.132 |    18,612 |    10,636 |  5,228 |
| Seek 5000   |    35.422 |    22,236 |    14,396 |  5,436 |
| Regex 1000  |     1.181 |    24,868 |    17,288 |  5,228 |
| Seek 30000  | 2,390.686 |    53,228 |    40,332 |  2,968 |
| Regex 2000  |    10.596 |    58,160 |    51,376 |  3,140 |
| Split 1000  |     1.054 |   103,620 |    93,100 |  3,204 |
| Regex 5000  |   128.589 |   258,056 |   248,016 |  3,204 |
| Split 2000  |     4.490 |   360,036 |   349,388 |  3,204 |
| Split 5000  |   598.668 | 2,151,664 | 1,423,468 |  2,764 |

The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels).

As a "guestimate" for the 30,000 x 30,000 case for which the seek solution use roughly 50Mb, the regex solution would use 7GB memory and the split method would use about 75GB memory... Both these are more memory+swap than the machine that I'm using has!

**File sizes:**

| name         | rows   | columns | size       | row size |
| ------------ | -----: | ------: | ---------: | -------: |
| in-small.txt |      5 |       3 |  61  bytes |       12 |
|  in-1000.txt |  1,001 |   1,001 | 6.6 Mbytes |     6.7K |
|  in-2000.txt |  2,001 |   2,001 |  27 Mbytes |    13.5K |
|  in-5000.txt |  5,001 |   5,001 | 165 Mbytes |    33.6K |
| in-30000.txt | 30,001 |  30,001 | 5.8 Gbytes |   201.0K |

If we look at the timings by method we can see that for the smaller files the `split` is
the most efficient {but the difference is relatively small}. But as the file size increases
then it soon becomes the least efficient:

**Timing comparisons:**

These are based on a single run of the script. When run multiple times the seek for 2000,
can be faster than the split for 2000. Fastest script is in **bold**, slowest in *italics*.
Only the seek method is used for the largest file as the Split/Regex models which cause
the script to fail with an out of memory issue.

| Size   | Split       | Regex       | Seek          |
| -----: | ----------: | ----------: | ------------: |
| small  |   **0.000** |     0.000   |      *0.000*  |
| 1000   |   **0.934** |     1.293   |      *1.346*  |
| 2000   |   **4.490** |    *9.040*  |       5.132   |
| 5000   |  *527.614*  |   130.411   |    **35.442** |
| 30000  |         -   |         -   | **2,390.686** |

The efficiency of the seek method for larger files is almost certainly to do with the
improved memory management.