diff options
| author | James Smith <js5@sanger.ac.uk> | 2021-04-27 11:34:46 +0100 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2021-04-27 11:34:46 +0100 |
| commit | d6e47833e1d56c4e3f698b04af5228b0157106b5 (patch) | |
| tree | a7625d6c25a643c1871fd6b39b5e47c3491bd3f7 | |
| parent | 0b20bcb44f62e12d33cc4080e45f93a47b0cddb6 (diff) | |
| download | perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.tar.gz perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.tar.bz2 perlweeklychallenge-club-d6e47833e1d56c4e3f698b04af5228b0157106b5.zip | |
First copy of "blog"
| -rw-r--r-- | challenge-110/james-smith/README.md | 230 |
1 files changed, 229 insertions, 1 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md index 3b736748ea..26fdc5d217 100644 --- a/challenge-110/james-smith/README.md +++ b/challenge-110/james-smith/README.md @@ -1 +1,229 @@ -Solutions by James Smith. +# Challenge 1 - valid phone numbers... + +You are given a text file - Write a script to display all valid phone numbers in the given text file. + +## Solution 1 + +This is what Perl was designed for - pattern matching strings. The format we accept numbers is one of: + +``` + +nn nnnnnnnnnn + (nn) nnnnnnnnnn + nnnn nnnnnnnnnn +``` + +We can write this as a single regex - there are two parts to the pattern - the prefix and the number +itself. The latter is just a simple number match {we don't specify length as in some countries this +is of different lengths}. + +We group the three prefix patterns into a group match with `(`s and `|`s - remembering to add `?:` so +we save memory by not storing the match. + +We wrap this regex in a function call: +``` perl +sub is_valid_phone_number { + return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z}; +} +``` +We can then just use this to grep over the lines of the file.... + +``` perl +print grep { is_valid_phone_number($_) } <>; +``` + +note we use `print` rather than `say` as we haven't stripped the return character from the line! + +# Challenge 2 - transposing a file + +You are given a text file. Write a script to transpose the contents of the given file. + +## Some background - I've seen this before! + +Now this is a problem I have come across before - I work at a genomic institute which handles large +amounts of data - some of these are in the forms of very large tab or comma separated value files +containing many 1000s of rows and many 1000s of columns, e.g. gene expression vs samples, genes vs +variations. These files can be 10G or larger in size.... + +Often these files are generated by automatic processes - but when you want to perform the next step +you would rather work down columns rather than across rows...! + +The problem as was posed on a Python mailing list is that this was taking for ever to run.... +Firstly moving back to python 2 helped, but unless the scripts were run on big memory machine {64G+} +then it was still taking a long time... + +As a non-python developer on the list - I looked to see if this could be easily achieved in Perl, to +show the pythonistas that actually Perl was still a better language for this sort of thing. + +Investigating the problem I realised that the method they were using was a slurp and print model.... +The problem with that for such large files was memory. Once slurped, chopped etc the machine was +swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3 +methods of performing this + +## Solution 2a - The simplest solution - load in and split into arrays of arrays. + +The simplest solution is to slurp the data in the file into an array of array(ref)s. + +This is just a case of a `map` which splits the line into it's chunks - we use chomp +to remove any unneeded line endings. + +Once we have an array of arrays we take the first entry out of each row and printing +these in one line. We repeat this until we reach the end of the lines... Again this +is a simple one liner..... + +``` perl +sub transpose_split { + open my $fh, '<', $_[0]; + my @in = map { chomp;[ split /,/ ] } <$fh>; ## Slurp into array + close $fh; + open $fh, '>', $_[1]; + say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]}; ## Generate transpose; + close $fh; +} +``` + +For an 1000 x 1000 file (6.75Mb), this uses approximately 90Mb of memory. This is +approximately 12-15 x space multiplier.. + +## Solution 2b - Load in but don't split into arrays + +This time we want to look at how we can save memory - one of the problems with the +solution above is the huge overhead of storing data in an array. + +To resolve this we can just slurp all the data into the array, and then use a regex +to strip the characters up to the `,` or "end of line" and output them, until the +strings are empty. + +``` perl +sub transpose_regex { + open my $fh, '<', $_[0]; + my @in = <$fh>; ## Slurp into array + close $fh; + open $fh, '>', $_[1]; + say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0]; + close $fh; +} +``` +This is more efficient - for the 1000 x 1000 example it only uses approx 9Mb of +memory this is on only about a 1.33 x space multiplier... This still becomes more +of an issue as the file size gets larger + +## Solution 2c - The ultimate large file solution - using `tell`, `seek` and `read`; + +Finally a solution for "very" large-files... + +Rather than slurping all the data into memory in one go - we instead pull content of +each of the lines a few bytes at a time, working with that - and going back to the +file system to retrieve more data. + +Perl has some simple functions for moving the "head" around in the file: + + * `tell` - return the current location in the file + * `seek` - set the current location in the file. + +Additionally it has another way of accessing the file content rather than `<$fh>`, +by using the lower level `read` command. This reads contents for the file and +stores it into a scalar variable. + +So this requires a "double" pass solution - first of all we use `<$fh>` to retrieve +every line of data from the file, noting the start and end of the line {in byte offsets} and +remebering the first `$BYTES` bytes from that line {in this example we set this to 1000 bytes}. + +We then use a similar approach to the previous one - where we use a regex to find the value +of the first column. But if we can't find a comma OR the end of the line - we retrieve another +`$BYTES` bytes of data to the string (until we can find a comma or the end of the line). + +If we get to the end of the line we only retrieve the number of bytes left in that line. + +``` perl +sub transpose_seek { + my($prev,@pos) = (0); + open my $fh, '<', $_[0]; + open my $ofh, '>', $_[1]; + ## Loop through the file and get the start/end of each line + push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) && + ( ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]), $prev=tell $fh) while <$fh>; + + while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { + my @line; + foreach(@pos) { + while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { + seek $fh, $_->[0], 0; + read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2]; + $_->[0] = tell $fh; + } + $_->[2] =~ s{^([^,\r\n]+)[,\r\n]*}{}; + push @line, $1; + } + say ${$ofh} join q(,), @line; + } +} +``` + +### Notes on this code... + + * `@pos` is an array of triples: + + * **`[0]`** - the current offset in the file for the next chunk of this line; + + * **`[1]`** - the end of the current line; + + * **`[2]`** - the buffer for this line (initially the first `$BYTES` bytes) + + * Let's look at the three while loops.... + + * The push/while line slurping in data + + We user `while <$fh>` here rather than `foreach` as `while` doesn't accumulate + data into an array, where as `foreach <$fh>` does, saving memory. + +``` perl +while( <$fh> ) { + push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) && + ( + ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]), + $prev=tell $fh + ) +} +``` + This uses a lot of ***1-liner*** tricks - first of all we push the start, end and first + `$BYTES` bytes into the @pos array. This always evaluates to true, so we run the second + bracketed statement. This (a) sets the start of the next block to the end of this current + line if we have overshot, and then (b) updates the `$prev` value so we know where the + next line starts in the file.... + + * Now let us look at the 2nd while loop (outer)... + +``` perl + while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { + + } +``` + This is checking that (a) we haven't run out of data to be retrieved for the first and + (b) haven't run out of data we have already retrieved. + + + * Finally the 3rd while loop (inner)... + +``` perl + while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { + seek $fh, $_->[0], 0; + read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2]; + $_->[0] = tell $fh; + } +``` + In this loop we see if the row does not contain a comma AND there is data left... If this is the + case we have to retrieve more data from the file. We do this by first `seek`ing to the location + in the file that we need to get data from. We then retrieve the either $BYTES `bytes` of data (or + all the data left for the row {if it is less than `$BYTES` bytes.} + We then update the location for that particular row (using `tell`). + + Note also we use the 4 parameter version of read. + + `read $fh, $buffer, $bytes, $offset` + + By adding the offset - we can easily append this content onto the end of our buffer string. We have + to use length `$_->[2]` as you can use -ve indecies to read into the buffer with an offset from the + end - but this only works for -1, -2 etc not "-0". + + * We then use the regex trick to get the first column of the data. + |
