# Perl Weekly Challenge #110 # Challenge 1 - valid phone numbers... You are given a text file - Write a script to display all valid phone numbers in the given text file. ## Solution 1 This is what Perl was designed for - pattern matching strings. The format we accept numbers is one of: ``` +nn nnnnnnnnnn (nn) nnnnnnnnnn nnnn nnnnnnnnnn ``` We can write this as a single regex - there are two parts to the pattern - the prefix and the number itself. The latter is just a simple number match {we don't specify length as in some countries this is of different lengths}. We group the three prefix patterns into a group match with `(`s and `|`s - remembering to add `?:` so we save memory by not storing the match. We wrap this regex in a function call: ``` perl sub is_valid_phone_number { return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z}; } ``` or "commented" using the "x" modifier... ``` perl sub is_valid_phone_number { return m{ \A # Start of line \s* # Possibly white-space (?: # Prefix - one of: [+]\d+ | # +{digits} 00\d+ | # 00{digits} [(]\d+[)] # ({digits}) ) \s+ # Some white-space \d+ # String of numbers \s* # Possibly white-space \Z # End of line }x; } ``` We can then just use this to grep over the lines of the file.... ``` perl print grep { is_valid_phone_number($_) } <>; ``` note we use `print` rather than `say` as we haven't stripped the return character from the line! # Challenge 2 - transposing a file You are given a text file. Write a script to transpose the contents of the given file. ## Some background - I've seen this before! Now this is a problem I have come across before - I work at a genomic institute which handles large amounts of data - some of these are in the forms of very large tab or comma separated value files containing many 1000s of rows and many 1000s of columns, e.g. gene expression vs samples, genes vs variations. These files can be 10G or larger in size.... Often these files are generated by automatic processes - but when you want to perform the next step you would rather work down columns rather than across rows...! The problem as was posed on a Python mailing list is that this was taking for ever to run.... Firstly moving back to python 2 helped, but unless the scripts were run on big memory machine {64G+} then it was still taking a long time... As a non-python developer on the list - I looked to see if this could be easily achieved in Perl, to show the pythonistas that actually Perl was still a better language for this sort of thing. Investigating the problem I realised that the method they were using was a slurp and print model.... The problem with that for such large files was memory. Once slurped, chopped etc the machine was swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3 methods of performing this. ## Solution 2a - The simplest solution - load in and split into arrays of arrays. The simplest solution is to slurp the data in the file into an array of array(ref)s. This is just a case of a `map` which splits the line into it's chunks - we use chomp to remove any unneeded line endings. Once we have an array of arrays we take the first entry out of each row and printing these in one line. We repeat this until we reach the end of the lines... Again this is a simple one liner..... ``` perl sub transpose_split { ## Slurp into array open my $fh, '<', $_[0]; my @in = map { chomp;[ split /,/ ] } <$fh>; close $fh; ## Generate transpose; open $fh, '>', $_[1]; say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]}; close $fh; } ``` For an 1000 x 1000 file (6.75Mb), this uses approximately 90Mb of memory. This is approximately 12-15 x space multiplier.. ## Solution 2b - Load in but don't split into arrays This time we want to look at how we can save memory - one of the problems with the solution above is the huge overhead of storing data in an array. To resolve this we can just slurp all the data into the array, and then use a regex to strip the characters up to the `,` or "end of line" and output them, until the strings are empty. ``` perl sub transpose_regex { ## Slurp into array open my $fh, '<', $_[0]; my @in = <$fh>; close $fh; ## Generate transpose; open $fh, '>', $_[1]; say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0]; close $fh; } ``` This is more efficient - for the 1000 x 1000 example it only uses approx 9Mb of memory this is on only about a 1.33 x space multiplier... This still becomes more of an issue as the file size gets larger ## Solution 2c - The ultimate large file solution - using `tell`, `seek` and `read`; Finally a solution for "very" large-files... Rather than slurping all the data into memory in one go - we instead pull content of each of the lines a few bytes at a time, working with that - and going back to the file system to retrieve more data. Perl has some simple functions for moving the "head" around in the file: * `tell` - return the current location in the file * `seek` - set the current location in the file. Additionally it has another way of accessing the file content rather than `<$fh>`, by using the lower level `read` command. This reads contents for the file and stores it into a scalar variable. So this requires a "double" pass solution - first of all we use `<$fh>` to retrieve every line of data from the file, noting the start and end of the line {in byte offsets} and remebering the first `$BYTES` bytes from that line {in this example we set this to 1000 bytes}. We then use a similar approach to the previous one - where we use a regex to find the value of the first column. But if we can't find a comma OR the end of the line - we retrieve another `$BYTES` bytes of data to the string (until we can find a comma or the end of the line). If we get to the end of the line we only retrieve the number of bytes left in that line. ```perl sub transpose_seek { my ( $prev, @pos ) = ( 0 ); open my $fh, '<', $_[0]; ( push @pos, [ $prev+$BYTES, tell $fh, substr $_, 0, $BYTES ] ) && ( $prev = tell $fh ) while <$fh>; open my $ofh, '>', $_[1]; while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { my $j = ''; foreach( @pos ) { while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { seek $fh, $_->[0], 0; ## 0 = from start of file! read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2]; $_->[0] += $BYTES; } $_->[2] =~ s{^([^,\r\n]+)[,\r\n]*}{}; print {$ofh} $j, $1; $j ||= ','; } say {$ofh} ''; } } ``` ### Notes on this code... * `@pos` is an array of triples: * **`[0]`** - the current offset in the file for the next chunk of this line; * **`[1]`** - the end of the current line; * **`[2]`** - the buffer for this line (initially the first `$BYTES` bytes) * Let's look at the three while loops.... * The push/while line slurping in data We use `while <$fh>` here rather than `foreach` as `while` doesn't accumulate data into an array, where as `foreach <$fh>` does, saving memory. ``` perl while( <$fh> ) { push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) && ( $prev=tell $fh ); } ``` * This uses a lot of ***1-liner*** tricks - first of all we push the start, end and first `$BYTES` bytes into the @pos array. This always evaluates to true, so we run the second bracketed statement. This (a) sets the start of the next block to the end of this current line if we have overshot, and then (b) updates the `$prev` value so we know where the next line starts in the file.... * Now let us look at the 2nd while loop (outer)... ``` perl while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { } ``` This is checking that (a) we haven't run out of data to be retrieved for the first and (b) haven't run out of data we have already retrieved. * Finally the 3rd while loop (inner)... ``` perl while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { seek $fh, $_->[0], 0; read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2]; $_->[0] += $BYTES; } ``` In this loop we see if the row does not contain a comma AND there is data left... If this is the case we have to retrieve more data from the file. We do this by first `seek`ing to the location in the file that we need to get data from. We then retrieve the either $BYTES `bytes` of data (or all the data left for the row {if it is less than `$BYTES` bytes.} We then update the location for that particular row (by adding `$BYTES` we can ignore the fact that we overshot. Note also we use the 4 parameter version of read. `read $fh, $buffer, $bytes, $offset` By adding the offset - we can easily append this content onto the end of our buffer string. We have to use `length $_->[2]` as you can use -ve indecies to read into the buffer with an offset from the end - but this only works for -1, -2 etc not "-0". * We then use the regex trick in 2b to get the first column of the data. * Memory usage: * This script does not load the file all in one go - so really needs a lot less memory (vs more disc accesses). It is linear in the number of lines, e.g. for the 1000 line file we load in roughly 1Mb of data at a time, and the memory usage is roughly 1.3Mb. * Rather than pushing to an array for the output, we print immediately. We use a prefix `$j` which starts with the value `''` so that we don't print a comma before the first entry, and changes to `','` after the first column (the only time `$js` is "false"). Printing every entry doesn't affect IO performance as the output is buffered until the print buffer is full - so both methods take the same time - except the array method uses subtly more memory. * Note this is `O(n)` as well as if the rows get longer then the number of bytes used per line does not increase. * Having played a bit - the sweet spot of `$BYTES` lies somewhere between 1K and 2K. Smaller makes the regex in the `split` more efficient, larger reduces the (read) file IO overhead. ### Some information about speed/memory etc... The following are timings on a single core, 2G RAM, 4G swap machine: **Timings:** We list these in order of "memory consumption"... | Method/size | Time (s) | Kbytes | resident | shared | | ----------- | --------: | --------: | --------: | -----: | | Seek small | 0.000 | 16,024 | 7,836 | 5,228 | | Regex small | 0.000 | 16,024 | 7,836 | 5,228 | | Split small | 0.000 | 16,024 | 7,836 | 5,228 | | Seek 1000 | 1.277 | 17,196 | 9,320 | 5,228 | | Seek 2000 | 5.132 | 18,612 | 10,636 | 5,228 | | Seek 5000 | 35.422 | 22,236 | 14,396 | 5,436 | | Regex 1000 | 1.181 | 24,868 | 17,288 | 5,228 | | Seek 30000 | 2,390.686 | 53,228 | 40,332 | 2,968 | | Regex 2000 | 10.596 | 58,160 | 51,376 | 3,140 | | Split 1000 | 1.054 | 103,620 | 93,100 | 3,204 | | Regex 5000 | 128.589 | 258,056 | 248,016 | 3,204 | | Split 2000 | 4.490 | 360,036 | 349,388 | 3,204 | | Split 5000 | 598.668 | 2,151,664 | 1,423,468 | 2,764 | The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels). As a "guestimate" for the 30,000 x 30,000 case for which the seek solution use roughly 50Mb, the regex solution would use 7GB memory and the split method would use about 75GB memory... Both these are more memory+swap than the machine that I'm using has! **File sizes:** | name | rows | columns | size | row size | | ------------ | -----: | ------: | ---------: | -------: | | in-small.txt | 5 | 3 | 61 bytes | 12 | | in-1000.txt | 1,001 | 1,001 | 6.6 Mbytes | 6.7K | | in-2000.txt | 2,001 | 2,001 | 27 Mbytes | 13.5K | | in-5000.txt | 5,001 | 5,001 | 165 Mbytes | 33.6K | | in-30000.txt | 30,001 | 30,001 | 5.8 Gbytes | 201.0K | If we look at the timings by method we can see that for the smaller files the `split` is the most efficient {but the difference is relatively small}. But as the file size increases then it soon becomes the least efficient: **Timing comparisons:** These are based on a single run of the script. When run multiple times the seek for 2000, can be faster than the split for 2000. Fastest script is in **bold**, slowest in *italics*. Only the seek method is used for the largest file as the Split/Regex models which cause the script to fail with an out of memory issue. | Size | Split | Regex | Seek | | -----: | ----------: | ----------: | ------------: | | small | **0.000** | 0.000 | *0.000* | | 1000 | **0.934** | 1.293 | *1.346* | | 2000 | **4.490** | *9.040* | 5.132 | | 5000 | *527.614* | 130.411 | **35.442** | | 30000 | - | - | **2,390.686** | The efficiency of the seek method for larger files is almost certainly to do with the improved memory management.