diff options
| author | drbaggy <js5@sanger.ac.uk> | 2021-04-28 00:48:16 +0100 |
|---|---|---|
| committer | drbaggy <js5@sanger.ac.uk> | 2021-04-28 00:48:16 +0100 |
| commit | c4f6f131cf58633c068d07d8888abfe7ffc7000a (patch) | |
| tree | 3521b6bb23a3bcb81204e78c08256b2e67558b8f | |
| parent | 6ab10bd9b8703390dbfb1b058cbed79844b87984 (diff) | |
| download | perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.tar.gz perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.tar.bz2 perlweeklychallenge-club-c4f6f131cf58633c068d07d8888abfe7ffc7000a.zip | |
Update README.md
| -rw-r--r-- | challenge-110/james-smith/README.md | 100 |
1 files changed, 66 insertions, 34 deletions
diff --git a/challenge-110/james-smith/README.md b/challenge-110/james-smith/README.md index 8e08f1b553..c9e2e1d24b 100644 --- a/challenge-110/james-smith/README.md +++ b/challenge-110/james-smith/README.md @@ -1,3 +1,5 @@ +# Perl Weekly Challenge #110 + # Challenge 1 - valid phone numbers... You are given a text file - Write a script to display all valid phone numbers in the given text file. @@ -20,11 +22,33 @@ We group the three prefix patterns into a group match with `(`s and `|`s - remem we save memory by not storing the match. We wrap this regex in a function call: + ``` perl sub is_valid_phone_number { return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z}; } ``` + +or "commented" using the "x" modifier... + +``` perl +sub is_valid_phone_number { + return m{ + \A # Start of line + \s* # Possibly white-space + (?: # Prefix - one of: + [+]\d+ | # +{digits} + 00\d+ | # 00{digits} + [(]\d+[)] # ({digits}) + ) + \s+ # Some white-space + \d+ # String of numbers + \s* # Possibly white-space + \Z # End of line + }x; +} +``` + We can then just use this to grep over the lines of the file.... ``` perl @@ -57,7 +81,7 @@ show the pythonistas that actually Perl was still a better language for this sor Investigating the problem I realised that the method they were using was a slurp and print model.... The problem with that for such large files was memory. Once slurped, chopped etc the machine was swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3 -methods of performing this +methods of performing this. ## Solution 2a - The simplest solution - load in and split into arrays of arrays. @@ -72,11 +96,13 @@ is a simple one liner..... ``` perl sub transpose_split { + ## Slurp into array open my $fh, '<', $_[0]; - my @in = map { chomp;[ split /,/ ] } <$fh>; ## Slurp into array + my @in = map { chomp;[ split /,/ ] } <$fh>; close $fh; + ## Generate transpose; open $fh, '>', $_[1]; - say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]}; ## Generate transpose; + say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]}; close $fh; } ``` @@ -95,9 +121,11 @@ strings are empty. ``` perl sub transpose_regex { + ## Slurp into array open my $fh, '<', $_[0]; - my @in = <$fh>; ## Slurp into array + my @in = <$fh>; close $fh; + ## Generate transpose; open $fh, '>', $_[1]; say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0]; close $fh; @@ -139,10 +167,14 @@ sub transpose_seek { my($prev,@pos) = (0); open my $fh, '<', $_[0]; open my $ofh, '>', $_[1]; - ## Loop through the file and get the start/end of each line + ## Loop through the file and get the start/end position of each line, + ## and the first $BYTES characters of each line... push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) && ( ($pos[-1][0]>$pos[-1][1]) && ($pos[-1][0]=$pos[-1][1]), $prev=tell $fh) while <$fh>; + ## While we still have "columns" loop through each row and grab the first + ## entry and output results. + while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { my @line; foreach(@pos) { @@ -241,42 +273,42 @@ sub transpose_seek { The following are timings on a single core, 2G RAM, 4G swap machine: -| Method/size | Time (s) | Kbytes | resident | shared | -| ----------- | -------: | -----: | -------: | -----: | -| Seek small | 0.001 | 16016| 7836| 5228 | -| Regex small | 0.000 | 16016| 7836| 5228 | -| Split small | 0.000 | 16016| 7836| 5228 | -| Seek 1000 | 1.346 | 17388| 9320| 5228 | -| Seek 2000 | 5.841 | 18848| 10636| 5228 | -| Seek 5000 | 54.208 | 23044| 14972| 5228 | -| Regex 1000 | 1.293 | 25492| 17288| 5228 | -| Seek 30000 | 3003.220 | 57312| 43948| 2720 | -| Regex 2000 | 9.040 | 63896| 51376| 3140 | -| Split 1000 | 0.934 | 105784| 93100| 3204 | -| Regex 5000 | 130.411 | 260432| 248016| 3204 | -| Split 2000 | 6.780 | 362028| 349388| 3204 | -| Split 5000 | 527.614 | 2153576| 1423468| 2764 | +| Method/size | Time (s) | Kbytes | resident | shared | +| ----------- | -------: | -----: | -------: | -----: | +| Seek small | 0.001 | 16016 | 7836 | 5228 | +| Regex small | 0.000 | 16016 | 7836 | 5228 | +| Split small | 0.000 | 16016 | 7836 | 5228 | +| Seek 1000 | 1.346 | 17388 | 9320 | 5228 | +| Seek 2000 | 5.841 | 18848 | 10636 | 5228 | +| Seek 5000 | 54.208 | 23044 | 14972 | 5228 | +| Regex 1000 | 1.293 | 25492 | 17288 | 5228 | +| Seek 30000 | 3003.220 | 57312 | 43948 | 2720 | +| Regex 2000 | 9.040 | 63896 | 51376 | 3140 | +| Split 1000 | 0.934 | 105784 | 93100 | 3204 | +| Regex 5000 | 130.411 | 260432 | 248016 | 3204 | +| Split 2000 | 6.780 | 362028 | 349388 | 3204 | +| Split 5000 | 527.614 | 2153576 | 1423468 | 2764 | The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels). File sizes: -| name | size | row size | -| ----- | -----: | ----: | -| small | 61 bytes | 12 | -| 1000 | 6.6 Mbytes | 6.7K | -| 2000 | 27 Mbytes | 13.5K | -| 5000 | 165 Mbytes | 33.6K | -| 30000 | 5.8 Gbytes | 201.0K | +| name | size | row size | +| ------------ | ---------: | -------: | +| in-small.txt | 61 bytes | 12 | +| in-1000.txt | 6.6 Mbytes | 6.7K | +| in-2000.txt | 27 Mbytes | 13.5K | +| in-5000.txt | 165 Mbytes | 33.6K | +| in-30000.txt | 5.8 Gbytes | 201.0K | If we look at the timings by method we can see that for the smaller files the `split` is the most efficient {but the difference is relatively small}. But as the file size increases then it soon becomes the least efficient: -| Size | Split | Regex | Seek | -| -----: | ----: | ----: | ----: | -| small | **0.000** | 0.000 | *0.001* | -| 1000 | **0.934** | 1.293 | *1.346* | -| 2000 | 6.890 | *9.040* | **5.841** | -| 5000 | *527.614* | 130.411 | **54.208** | -| 30000 | - | - | **3003.220** | +| Size | Split | Regex | Seek | +| -----: | ----------: | ----------: | -----------: | +| small | **0.000** | 0.000 | *0.001* | +| 1000 | **0.934** | 1.293 | *1.346* | +| 2000 | 6.890 | *9.040* | **5.841** | +| 5000 | *527.614* | 130.411 | **54.208** | +| 30000 | - | - | **3003.220** | |
