diff options
| author | drbaggy <js5@sanger.ac.uk> | 2021-05-13 09:53:04 +0100 |
|---|---|---|
| committer | drbaggy <js5@sanger.ac.uk> | 2021-05-13 09:53:04 +0100 |
| commit | a55ad5eaa86651991f89035202fc71e8eacdc931 (patch) | |
| tree | 27ff02cbe59322a8aba92b697258d20bb6a91052 /challenge-112/james-smith | |
| parent | 04dcd9b878bf160b94752ea11b92c4dd7451de1d (diff) | |
| download | perlweeklychallenge-club-a55ad5eaa86651991f89035202fc71e8eacdc931.tar.gz perlweeklychallenge-club-a55ad5eaa86651991f89035202fc71e8eacdc931.tar.bz2 perlweeklychallenge-club-a55ad5eaa86651991f89035202fc71e8eacdc931.zip | |
Update README.md
Diffstat (limited to 'challenge-112/james-smith')
| -rw-r--r-- | challenge-112/james-smith/README.md | 807 |
1 files changed, 303 insertions, 504 deletions
diff --git a/challenge-112/james-smith/README.md b/challenge-112/james-smith/README.md index 71e2905051..40f82a0eec 100644 --- a/challenge-112/james-smith/README.md +++ b/challenge-112/james-smith/README.md @@ -1,4 +1,4 @@ -# Perl Weekly Challenge #111 +# Perl Weekly Challenge #112 You can find more information about this weeks, and previous weeks challenges at: @@ -8,592 +8,391 @@ If you are not already doing the challenge - it is a good place to practise your **perl** or **raku**. If it is not **perl** or **raku** you develop in - you can submit solutions in whichever language you feel comfortable with. -# Challenge 1 - Search Matrix +# Challenge 1 - Canonical Path -**You are given 5x5 matrix filled with integers such that each row is sorted from left to -right and the first integer of each row is greater than the last integer of the previous -row.** +**You are given a string path, starting with a slash ‘/'. Write a script to +convert the given absolute path to the simplified canonical path.** -**Write a script to find a given integer in the matrix using an efficient search algorithm.** +In a Unix-style file system: -## Solution to challenge 1 - -Today's is an interesting challenge ... to check to see if a value is in the matrix.... + * A period '.' refers to the current directory + * A double period '..' refers to the directory up a level + * Multiple consecutive slashes ('//') are treated as a single slash '/' -Initially this looks like we need to define some efficient sorting algorithm that flattens -the array and then searches it {e.g. binary search}. +The canonical path format: -We implement this... but we compare it against some alternative simple methods. We don't -get the result we expect... + * The path starts with a single slash '/'. + * Any two directories are separated by a single slash '/'. + * The path does not end with a trailing '/'. + * The path only contains the directories on the path from the root directory to the target file or directory -## Binary search +## Note.... -We first flatten the array. Then we loop until the array is empty or it has a solitary value +Please note there is an ambiguity in the question - when then path contains no +files - as it cannot start with a '/' and not end with a '/' - so we have +to make a choice do we return '/' or do we return ''. - If we get a match with the "middle" value we return 1; - Otherwise we remove the half of the list using `splice` which - the number isn't in. - -We return 0 if the array has length 0. If the length is 1 we return -1 or 0 depending on whether the array is search value or not. +In our case we decide to return it as the empty string. +This has the advantage that there is a level of consistency if you do... -```perl -sub find_val_search { - my( $val, $m, @list ) = ( $_[0], 0, map { @{$_} } @{$_[1]} ); - - $list[ $m = @list >> 1 ] == $val ? ( return 1 ) - : $list[ $m ] > $val ? ( splice @list, $m ) - : ( splice @list, 0, $m+1 ) - while @list>1; - return @list && $list[0] == $val ? 1 : 0; -} -``` +$parent_dir.canonical_path('/a'); -## The other (better) methods: - -### grep on the flattened array - -```perl -sub find_val_grep_map { - my($v,$m)=@_; - return 0 + grep { $_ == $v } map { @{$_} } @{$m}; -} -``` +or -### reducing array with grep and then combining with map +$parent_dir.canonical_path('/'); -Uses `grep` on each sub array to return either an -empty array or an array containing the match. -We then use `grep` to further filter these results -for existances of the array. - -The resultant array will have length `0` or `1`. - -```perl -sub find_val_grep_grep { - my($v,$m)=@_; - return 0 + grep { grep { $_ == $v } @{$_} } @{$m}; -} -``` - -Now to avoid as many of the `grep`s as possible we can do two things. First check the -number is in the range of the matrix - this skips the outer `grep` for some values and -secondly - check the value is in the range for each line you are looping over.... This -leads us to: - -```perl -sub find_val_grep_grep_ext { - my($v,$m)=@_; - return $v<$m->[0][0] || $v > $m->[-1][-1] - ? 0 - : 0 + grep { $v>=$_->[0] && $v<=$_->[-1] && grep { $_ == $v } @{$_} } @{$m}; -} -``` -## Efficiency +then it will always end without a "/"; -We use `cmpthese` to compare the performance... +To change the value the functions return you can replace the return +statement with either `return q(/). join q(/), @list` or +`return $str || q(/)`, depending on whether or not the function +stores the path elements in an array or a string. -Our methods are: - * **Search**: `find_val_search` - * **GrepMap**: `find_val_grep_map` - * **GrepGrep**: `find_val_grep_grep` - * **GrepExt**: `find_val_grep_grep_ext` - * **Flatten**: just flattens the array - basic benchmark to compare efficiency to that function.... +## Solution to challenge 1 -Timings using `Benchmark::cmpthese` +Again another interesting challenge ... we can see if we can improve +performance. -| | Rate | Search | GrepMap | GrepGrep | *Flatten* | GrepExt | -| --------- | --------: | -----: | ------: | -------: | --------: | ------: | -| Search | 4,857/s | -- | -3% | -34% | *-42%* | -68% | -| GrepMap | 4,995/s | 3% | -- | -32% | *-40%* | -67% | -| GrepGrep | 7,391/s | 52% | 48% | -- | *-11%* | -52% | -| *Flatten* | *8,326/s* | *71%* | *67%* | *13%* | *--* | *-46%* | -| GrepExt | 15,337/s | 216% | 207% | 108% | *84%* | -- | +Initially it looks quite complex - there are two solutions classes: + * splitting the string and creating/modifying an array of the + individual parts -Notes: + * splitting the string and creating/modifying a string - * *Flatten* is for comparison only - it actually does nothing other than flatten - the array - this highlights how efficient each algorithm is (and can be) - - * So we see that the grep_grep solution is nearly 60% more efficient than the search - algorithm (this is true for all search method algorithms which flatten - the array first), add the filtering to the outer grep - this becomes faster than - flatten - and around 3 times the speed of search. +## "Expanded perl code" -## Not flattening array +### Array - two loops... -As we have a limit on performance with the flattening operation. To -improve efficiency we will need to consider a different approach: **Which -therefore must not flatten!** + * We first split the directory into path parts and remove any that + are empty or "`.`". -### Do not flatten - find the row containing the number + * We loop through the array until we find a '`..`' if we do we + remove it and the previous entry. - * Rather than using `map`/`grep` on the outer array, we instead use - a "hard coded" binary search. + * We then repeat this until we don't find a '`..`' - * First we return 0 if the number is outside the range of numbers in - the list. + * To jump out of the loop we use `next "label"` to not just skip out + of the inner loop, but to also to restart the parent loop at the + same time. - * Then we look to see if the number is below the first value of the - 4th row. This means it is in rows 1, 2 or 3. So we test for that. - Otherwise it is in rows 4 or 5, so we can test that as well. - - * This function expression returns which row the number *could be in* + * Finally remove an initial "`..`" which wouldn't get removed by this + algorithm. -```perl - $v < $m->[3][0] - ? ( $v < $m->[1][0] ? 0 : $v < $m->[2][0] ? 1 : 2 ) - : ( $v < $m->[4][0] ? 3 : 4 ) -``` - -This leads the whole function to be: + * and join the array together with '`/`' - we add the `''` so that we + get the leading '`/`'. ```perl -sub find_val_dnf { - my($v,$m) = @_; - return $v < $m->[0][0] || $v > $m->[4][4] - ? 0 - : 0 + grep { $v == $_ } @{$m->[ $v < $m->[3][0] - ? ( $v < $m->[1][0] ? 0 : $v < $m->[2][0] ? 1 : 2 ) - : ( $v < $m->[4][0] ? 3 : 4 ) - ]}; +sub canonical_path_double { + my $directory_path = shift; + my @directory_names = grep { $_ ne '' && + $_ ne '.' } + split m{/}, + $directory_path; + + OUTER: while(1) { + foreach (1..$#directory_names) { + next unless $directory_names[$_] eq '..'; + splice @directory_names,$_-1,2; + next OUTER; + } + last; + } + shift @directory_names if @directory_names && $directory_names[0] eq '..'; + return join '/','',@directory_names; } ``` -### Do not flatten - optimal solution +### Array - 1-loop with backtracking - * As well as not using `map` or `grep` on the outside array we - do so on the inner array. +To avoid the double loop - we can use back-tracking so that when we +find a "`..`" instead of restarting the loop we just step back one +entry in the array - we see `$pointer--` after the `splice` +statement below. - * Once we have chosen the correct row then we split it into - two halves and check for a match in each half depending on - whether it matches the middle value or is above/below. + * If `$pointer == 0` then the `..` is at the start of the directory + list so we just remove it (we can't go back beyond the root + directory). ```perl -sub find_val_dnf_optimal { - my($v,$m,$t) = @_; - - return $v < $m->[0][0] || $v > $m->[4][4] - ? 0 - : ( $t = $m->[ $v < $m->[3][0] - ? ( $v < $m->[1][0] ? 0 : $v < $m->[2][0] ? 1 : 2 ) - : ( $v < $m->[4][0] ? 3 : 4 ) - ] ) && - ( return $v == $t->[2] ? 1 : - $v < $t->[2] ? - ( $v == $t->[0] || $v == $t->[1] ? 1 : 0) : - ( $v == $t->[4] || $v == $t->[3] ? 1 : 0) ); +sub canonical_path_array { + my $directory_path = shift; + my @directory_names = grep { $_ ne '' && + $_ ne '.' } + split m{/}, + $directory_path; + my $pointer = 1; + while( $pointer < @directory_names ) { + if( $directory_names[$pointer] eq '..' ) { + if( $pointer > 0 ) { + splice @directory_names, $pointer - 1, 2; + $pointer --; + } else { + shift @directory_names; + } + } else { + $pointer ++; + } + } + return join '/','', @directory_names; } ``` -### Do not flatten - generalised solution - - * Above we used the information that we are supplied with a 5x5 matrix - if the matrix is an arbitrary size we can change the two loops to - use a binary search algorithm. - - * Notes: - - * We use `($l+$r)>>1` to find the mid point - this is more efficient - than using `int(($l+$r)/2)`; Bit-shift operators are useful if you - are multiplying or dividing by 2. - - * The loops differ slightly because one is looking for an entry which - is within a range - the other for a specific value. The differencs - you will note are (1) the first loop do `$r=$n` where in the - second we do `$r=$n-1` - as the first is a range based search the - second an exact match; (2) The second loop checks for equality before - the binary search check - and returns 1 if true. The end eheck also - becomes $l<=$r as $r becomes less than $l when we have missed the - entry. +### String - 1-loop -```perl -sub find_val_general_dnf { - my($v,$m)=@_; - return 0 if$v<$m->[0][0]||$v>$m->[-1][-1]; - my($n,$l,$r)=(0,0,@{$m}-1); - $v>$m->[$n=($l+$r)>>1][-1]?($l=$n+1):($r=$n)while$r!=$l; - ($l,$r)=(0,@{$m=$m->[$l]}-1); - ($v==$m->[$n=($l+$r)>>1])?(return 1):$v>$m->[$n]?($l=$n+1):($r=$n-1)while$l<=$r; - return 0; -} -``` -Timings using `Benchmark::cmpthese` +Rather than store the parts in a list - we use a string to store +this canonical path - and we either add directories to the end of +it or remove them if we come across a "`..`". -| | Rate | Search | GrepMap | GrepGrep | *Flatten* | DNFGen | GrepExt | DNF | DNFOpt | -| --------- | --------: | -----: | ------: | -------: | --------: | -----: | ------: | -----: | -----: | -| Search | 4,316/s | -- | -20% | -41% | *-47%* | -64% | -71% | -79% | -81% | -| GrepMap | 5,391/s | 25% | -- | -26% | *-34%* | -55% | -64% | -73% | -76% | -| GrepGrep | 7,310/s | 69% | 36% | -- | *-10%* | -39% | -52% | -64% | -68% | -| *Flatten* | *8,123/s* | *88%* | *51%* | *11%* | *--* | *-33%* | *-46%* | *-60%* | *-65%* | -| DNFGen | 12,063/s | 179% | 124% | 65% | *48%* | -- | -20% | -40% | -47% | -| GrepExt | 15,106/s | 250% | 180% | 107% | *86%* | 25% | -- | -25% | -34% | -| DNF | 20,161/s | 367% | 274% | 176% | *148%* | 67% | 33% | -- | -12% | -| DNFOpt | 22,883/s | 430% | 324% | 213% | *182%* | 90% | 51% | 14% | -- | + * We achieve the former - by just concatenating a "`/`" and the + name to the end of the string. -We can see that this "optimal method" is about 5 times more efficient than the "search" -function we originally tried - even the generic double binary search is abour 2.5 times as -efficient than the binary search over the flattened matrix. + * The latter we strip this string off with a regex substitution: + `s{/[^/]+\Z}{}`. This works in all cases wherever the "`..`" + is in the list. -## Test script - -For completeness - this is the test and benchmarking script + * Note as we are looping through the array we can ignore the + grep and just skip out of the loop if the name is either "" + or "`.`". ```perl -#!/usr/local/bin/perl - -use strict; - -use warnings; -use feature qw(say); -use Test::More; -use Benchmark qw(cmpthese); - -my $N = @ARGV ? $ARGV[0] : 100_000; - -my $matrix = [ - [ 1, 2, 3, 5, 7 ], - [ 9, 11, 15, 19, 20 ], - [ 23, 24, 25, 29, 31 ], - [ 32, 33, 39, 40, 42 ], - [ 45, 47, 48, 49, 50 ], -]; - -## Create a test set - numbers from -10 to 60... -my %TEST_SET = map { $_ => 0 } (my @KEYS = -10..60); - -## Set all to 0, and then iterate through the elements of the matrix -## and set the numbers in the list to 1.... - -$TEST_SET{$_} = 1 foreach map { @{$_} } @{$matrix}; - -## Run the original PWC test examples... -is( find_val_search( 35, $matrix ), 0 ); -is( find_val_search( 39, $matrix ), 1 ); - -is( find_val_grep_grep( 35, $matrix ), 0 ); -is( find_val_grep_grep( 39, $matrix ), 1 ); -#... -#... -#... -is( find_val_find_val_dnf_optimal( 35, $matrix ), 0 ); -is( find_val_find_val_dnf_optimal( 39, $matrix ), 1 ); - -## Now run our full test set - from -10 to 60. This covers -## all cases within the list and a few either side... - -is( find_val_search( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; -is( find_val_grep_grep( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; -#... -#... -#... -is( find_val_dnf_optimal( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; - -done_testing(); - -cmpthese( $N, { - 'Search' => sub { find_val_search( $_, $matrix ) foreach @KEYS; }, - 'GrepGrep' => sub { find_val_grep_grep( $_, $matrix ) foreach @KEYS; }, - #... - #... - #... - 'DNFOpt' => sub { find_val_dnf_optimal( $_, $matrix ) foreach @KEYS; }, -} ); +sub canonical_path_string_fast { + my $path = shift; + my @directories = split m{/}, $path; + my $canonical_path = ''; + foreach my $directory_name ( @directories ) { + next if $directory_name eq ''; + next if $directory_name eq '.'; + if( $directory_name eq q(..) ) { + $canonical_path =~ s{/[^/]+\Z}{}; + } else { + $canonical_path .= q(/) . $directory_name; + } + } + return $canonical_path; +} ``` +### String fast - 1-loop -## Addendum 1 - using hash keys +Regexs are not the fastest way to perform simple matches strings +(and intern to modify them). We can speed up the trimming of the +canonical path by replacing the regex solution by using `rindex` +and the four-parameter version of `substr`. -As some people have used hash slices to solve this problem - I thought I -would look into this to -- Colin thought this would help in the perl -review! I've added the following function: + * `rindex $str, $needle` finds the index of `$needle` in `$str`. + Here we use it to find the last `/` in string. -```perl -sub find_val_hash { - my %list; - @list{ map { @{$_} } @{$_[1]} } = (); ## Quickest - return exists $list{$_[0]} ? 1: 0; -} -``` + * `substr $str, $offset, $length, $substitute` finds the chunk of + the string `$str` from `$offset` of given length `$length`. If a + fourth parameter is set then this region of the string is replaced + by `$substitute`. -which converts the nested arrayref into an array for the keys of `%list` -and then returns 1 if the value of a key in the hash. Like other methods -this throws away the fact that the array is originally sorted... + * We can use this to truncate the string by doing: + + `substr $path, rindex( $path, '/' ), ~0, '';` -Looking at performance - this is approximately 53% slower than the search -algorithm, and therefore less than a tenth of the speed of the optimal -solution. + In the two-parameter version of `substr` if we omit length then + this returns to the end of the string. In the four-parameter + version - we can't omit this - so have to use an alternative + value - it has to be bigger than the longest string possible. + We use the "bitwise-negation" operator "`~`" to generate the + largest value possible. This is: 18,446,744,073,709,551,615 or + just shy of 16 Exabytes - I believe this should be big enough! -## Addendum 2 - passing an array rather than an arrayref + (Note you can use -ve numbers but there is no way of doing `-0` + to trim to the end of the line) -I quickly checked to see if by passing the matrix as an array rather than -an arrayref was any faster - one less derefernce - but it is not - it -appears to be roughtly the same and speed sometimes faster. +The script then becomes: ```perl -sub flatten_array { - shift @_; - my @list = map { @{$_} } @_; - return 1; -} +sub canonical_path_string_fast { + my $path = shift; + my @directories = split m{/}, $path; + my $canonical_path = ''; + foreach my $directory_name ( @directories ) { + next if $directory_name eq ''; + next if $directory_name eq '.'; + if( $directory_name eq q(..) ) { + substr $canonical_path, + rindex( $canonical_path, '/' ), + ~0, ''; + } else { + $canonical_path .= q(/) . $directory_name; + } + } + return $canonical_path; +} ``` -## Addendum 3 - using `List::Util` methods +## "Compact perl code" AKA 1-liners.. -Just out of interest to compare the core `map` and `grep` against the various `List::Util` functions `first`/`any`, I've added 3 more functions: +Now we can look at how we can compact this code. Here though we +need to consider the trade off between size and performance - the +smallest code is not necessarily the fastest - as some of the +tricks to make the code compact also make it slower. -```perl -sub find_val_list_util { - my($v,$m) = @_; - my $t = first { $_->[-1] >= $v } @{$m}; - return ($t && any { $v == $_ } @{$t}) ? 1 : 0; -} - -sub find_val_any_any { - my($v,$m) = @_; - return (any { $v>=$_->[0] && $v <=$_->[-1] && (any { $_ == $v } @{$_}) } @{$m}) ? 1 : 0; -} +### The array code... -sub find_val_any_any_naive { - my($v,$m) = @_; - return (any { any { $_ == $v } @{$_} } @{$m}) ? 1 : 0; -} -``` +We have two versions of the code - which are slightly different +The `canonical_path_compact_opt` function is probably closer to +the 1-loop array function above. We use nested-ternaries to replace +the `if else` blocks. -The last one is similar to the `grep - grep` method above but uses `any` - now in theory this -should be faster as `any` stops at the first occcurance - but as `grep` is core and `any` is -not there will be a performance penalty - especially for such a small set of values. +As well as using ternaries to make the code shorter - we use a few +other of our "shortening" tricks: -To speed things upw ithe add a check to see if the value is inside the row greater/equal to the -first value and less than/equal to the last value. This makes a big difference. + * We use variable `$a` as this is a variable we don't need to `my`. + * We also store the split path name in `@_`, this has two advantages: + we don't need to `my` it and secondly - we can just use `shift` + without using a variable in the while loop - as `shift` in a + function shifts off `@_` -## Addendum 4 - caching the "flattening" + * We use **yoda** comparisons ( `"value" eq $variable` ) rather tha + the more normal `$variable eq "value"` as it means we can save a + byte { `$var eq''` vs `''eq$var` } as we don't need the extra space. -Optimizations like these are great if you are in the middle of a tight loop and calling the function a number -of times - if this is the same matrix each time - we may get some performance boosts by pre-flattening the matrix. + * We re-order the if/else so that the we unravel it into an: + `if() { } ifelse() {} else {}` format which is better for + nested ternaries - even if it may seen to be less readable. -I've included two more functions "Grep-@" & "Hash-%" which replicate "Grep-Map" and "Hash" to show how this can -speed things up.. The compute on the fly hash is the slowest method - but pre-cacheing the hash is 30 times faster, -the pre-flattened array speeds up the grep - but only by a factor of 2 and so is still only half the speed of -the *do not flatten* versions. + * In the code above we separated the `$pointer - 1` in the `splice` + line with the `$pointer--` in the next line to back-track. We can + use `--$pointer` to achieve this all within the splice statement. -**Note:** The second parameter in each of these is the respective flattened arrayref or hashref. + * To futher shorten the code in `canonical_path_compact` we + replace the `grep` clause with a regular expression which is + 7-bytes shorter. ```perl -sub find_val_grep_pre { - my($v,$m)=@_; - return 0 + grep { $_ == $v } @{$m}; +sub canonical_path_compact_opt { +## Remove the regex in the grep and use ternarys... +$a=1,@_=grep{''ne$_&&'.'ne$_}split/\//,shift; +$_[$a]ne'..'?$a++:$a?splice@_,--$a,2:shift while$a<@_; +join'/','',@_ } -sub find_val_hash_pre { - return exists $_[1]{$_[0]} ? 1: 0; -} -``` - -| Method | Rate | Hash | AANaive | Search | GrepMap | AnyAny | ListUtil | GrepGrep | *Flatten* | *Flatten@* | **preGrep** | DNFGen | GrepExt | DNF | DNFOpt | **preHash** | -| ----------- | -----------: | --------: | --------: | --------: | --------: | --------: | --------: | --------: | ---------: | ---------: | ----------: | -------: | -------: | -------: | -------: | ---------: | -| Hash | 2,019/s | -- | -26% | -59% | -63% | -67% | -69% | -73% | *-76%* | *-76%* | **-82%** | -83% | -87% | -90% | -91% | **-98%** | -| AANaive | 2,714/s | 34% | -- | -44% | -50% | -56% | -58% | -64% | *-68%* | *-68%* | **-75%** | -78% | -82% | -87% | -88% | **-97%** | -| Search | 4,873/s | 141% | 80% | -- | -10% | -20% | -24% | -35% | *-42%* | *-42%* | **-56%** | -60% | -68% | -76% | -79% | **-94%** | -| GrepMap | 5,435/s | 169% | 100% | 12% | -- | -11% | -15% | -27% | *-35%* | *-36%* | **-51%** | -55% | -64% | -73% | -76% | **-94%** | -| AnyAny | 6,127/s | 204% | 126% | 26% | 13% | -- | -5% | -18% | *-27%* | *-27%* | **-44%** | -50% | -60% | -70% | -73% | **-93%** | -| ListUtil | 6,427/s | 218% | 137% | 32% | 18% | 5% | -- | -14% | *-24%* | *-24%* | **-42%** | -47% | -58% | -68% | -72% | **-92%** | -| GrepGrep | 7,452/s | 269% | 175% | 53% | 37% | 22% | 16% | -- | *-11%* | *-12%* | **-32%** | -39% | -51% | -63% | -68% | **-91%** | -| *Flatten* | 8,418/s* | *317%* | *210%* | *73%* | *55%* | *37%* | *31%* | *13%* | *--* | *-0%* | ***-24%*** | *-31%* | *-45%* | *-58%* | *-63%* | ***-90%*** | -| *Flatten@* | 8,446/s* | *318%* | *211%* | *73%* | *55%* | *38%* | *31%* | *13%* | *0%* | *--* | ***-23%*** | *-31%* | *-45%* | *-58%* | *-63%* | ***-90%*** | -| **preGrep** | **11,013/s** | **446%** | **306%** | **126%** | **103%** | **80%** | **71%** | **48%** | ***31%*** | ***30%*** | **--** | **-10%** | **-28%** | **-46%** | **-52%** | **-87%** | -| DNFGen | 12,195/s | 504% | 349% | 150% | 124% | 99% | 90% | 64% | *45%* | *44%* | **11%** | -- | -20% | -40% | -47% | **-86%** | -| GrepExt | 15,244/s | 655% | 462% | 213% | 180% | 149% | 137% | 105% | *81%* | *80%* | **38%** | 25% | -- | -25% | -34% | **-82%** | -| DNF | 20,243/s | 903% | 646% | 315% | 272% | 230% | 215% | 172% | *140%* | *140%* | **84%** | 66% | 33% | -- | -12% | **-76%** | -| DNFOpt | 22,936/s | 1036% | 745% | 371% | 322% | 274% | 257% | 208% | *172%* | *172%* | **108%** | 88% | 50% | 13% | -- | **-73%** | -| **preHash** | **84,746/s** | **4098%** | **3022%** | **1639%** | **1459%** | **1283%** | **1219%** | **1037%** | ***907%*** | ***903%*** | **669%** | **595%** | **456%** | **319%** | **269%** | **--** | - -## Summary... - -So as we can see our **DNFOpt** (`find_val_dnf_optimal`) method is the optimal method running at approx 1,700,000 calculations per second, followed by the unoptimized **DNF** (`find_val_dnf`) method at approximately 1,500,000 caclulations per second - these are both optimized to know the matrix is 5x5. - -Interestingly the fastest method (without explicitly coding for 5x5) is the optimized version of the `grep`x`grep` of the **GrepExt** (`find_val_grep_grep_ext`) method at approximately 1,100,000 calculations per second. This slighlty nudges out the generic form of the nested binary search **DNFGen** (`find_val_general_dnf`) method at 900,000 calculations per second. - -Other things to note..: - * `grep`x`grep` method **GrepGrep** (`find_val_grep_grep`) is faster than all the versions using `List::Util` using `any` and `first` by approximately 15%, actually comparing like for like with **AANaive** (`find_val_any_any_naive`) is 2.75 times faster... I guess this is because `grep` is built in... There may be a sweet point where this changes as the matrix gets large though. - * The worst methods are: **Hash** (`find_val_hash`), **AANaive** (`find_val_any_any_naive`) and **Search** (`find_val_search`) at approximately 150,000, 200,000 and 350,000 calculations per second. - * The range between best and worst is an order of magnitude. - * I haven't included in this summary the two **preGrep** and **preHash** methods - which pre-flatten the matrix - and note that the **preHash** method is the fastest by far - nearly 4 times faster than are *best* method **DNFOpt**, even though the **Hash** method was worse (the ration of the two hash methods is approximately 40:1). This shows that if it is possible to cache some part of the calculation it can greatly improve performance if called lots of time. - * All these comparisons are based on the 5x5 matrix in the question - and relative speeds may change as the matrix dimensions get larger. - -** Quick extra addendum - -I've added another method which implements a binary search across the unflattened matrix - by effectively flattening the matrix and a 0..n location back into x,y-coorinates... - -```perl -sub find_val_binary { - my($v,$m)=@_; - return 0 if$v<$m->[0][0]||$v>$m->[-1][-1]; - my $size = @{$m}; - my($v2,$n,$l,$r)=(0,0,0,$size*$size-1); - while($l<=$r) { - $n=($l+$r)>>1; - return 1 if $v == ($v2 = $m->[$n/$size][$n%$size]); - $v>$v2?($l=$n+1):($r=$n-1); - } - return 0; +sub canonical_path_compact { +## Make shorted by using regex and ternarys +$a=1,@_=grep{!/^\.?$/}split/\//,shift; +$_[$a]ne'..'?$a++:$a?splice@_,--$a,2:shift while$a<@_; +join'/','',@_ } ``` -Interestingly this is approximately 25% slower than the **DNFGen** method... I would think because of the two *division operations* when getting the matrix value. - -| | Rate | Binary | DNFGen | -| ------ | ------: | -----: | -----: | -| Binary | 9200/s | -- | -25% | -| DNFGen | 12225/s | 33% | -- | - -# Challenge 2 - Ordered Letters - -**Given a word, you can sort its letters alphabetically (case insensitive). -For example, “beekeeper” becomes “beeeeekpr” and “dictionary” becomes -“acdiinorty”.** - -**Write a script to find the longest English words that don’t change when -their letters are sorted.** - -## Dictionaries available - -In this example we will consider the dictionaries available in Ubuntu, -as I am in UK - I will use the British English dictionaries: +### The string code - * `/usr/share/dict/british-english-small` - * `/usr/share/dict/british-english-large` - * `/usr/share/dict/british-english-huge` - * `/usr/share/dict/british-english-insane` +Here we re-implement the two string algorithms in compact 1-liners. -## Assumptions into what really is a word + * `canonical_path_fast` and `canonical_path_fastest` correspond + directly to the two methods above. -These dictionaries contain a number of "Words" which we wouldn't -necessarily consider to be words. To simplify this we will only -chose words which are all lowercase. + * `canonical_path_short` replaces the equality checks for "" and + "`.`" with the regex as we saw in the array code. -So we will filter these words out... - -A summary of the four dictionaries above give the following counts: - -| Name: | # words | # trimmed | -| ------ | ------: | --------: | -| small | 50,790 | 39,781 | -| large | 166,828 | 113,695 | -| huge | 344,861 | 245,593 | -| insane | 654,299 | 427,891 | - -## Solution to challenge 2 - -We will collect all of the words that meet this requirement. -We will collect them in an array `@max`. The first value will -be the length of the words in the list and the rest - -There are 4 parts to the loop... - - * The filters as above... - - `!/[^a-z]/` - -* A filter that skips words shorter than the max length - - `$max[0] <= length $_` - - * The calculation to see if the word matches - - `lc $_ eq join q(), sort split //, lc $_` - - * The code to either replace the array `@max` with the - newer longer word *or* to push the new word to the end - of the list. + * `canonical_path_shortest` removes the need for one of the + ternary operators by performing string multiplication when + adding the directory to the list. If the regex returns true, + then `"/$_"x!/.../` is `"/$_"x 0` or "". If the regex returns + false then `"/$_"x!/.../` is `"/$_"x 1` or "`/$_`". - `$max[0] == length $_ ? ( push @max, $_ ) : ( @max = (length $_, $_) )` +```perl +sub canonical_path_shortest { +$a=''; +'..'ne$_?$a.="/$_"x!/^\.?$/:$a=~s#/[^/]+$## for split'/',shift; +$a +} -The full function is.... +sub canonical_path_short { +$a=''; +/^\.?$/?0:'..'ne$_?$a.="/$_":$a=~s#/[^/]+$## for split'/',shift; +$a +} -```perl -sub longest { - open my $fh, q(<), $_[0]; - my @max = (0); - (chomp) ## Remove newline character - && !/[^a-z]/ ## Remove non-lower case characters - && ( $max[0] <= length $_ ) - ## Remove words that are too short - && ( $_ eq join q(), sort split //, $_ ) - ## Check the word is unchanged when the - ## letters are sorted - && ( $max[0] == length $_ - ? ( push @max, $_ ) - : ( @max = (length $_, $_) ) - ) - ## If the word is the same length as the maximal word - ## push it onto @max - so we store all the longest words - ## with maximum length. - ## If the word is longer than the max length (1st entry - ## in @max - reset max to include the new max length and - ## the word. - while <$fh>; - return "$_[0] > @max"; - ## Return the name of the file used, the size of the words - ## and a complete list of the words of that length. +sub canonical_path_fast { +$a=''; +'.'ne$_&&''ne$_&&('..'ne$_?$a.="/$_":$a=~s#/[^/]+$##)for split'/',shift; +$a } -``` -If you like the code more compact - here it is without the comments... +sub canonical_path_fastest { +$a=''; +'.'ne$_&&''ne$_&&('..'ne$_?$a.='/'.$_:substr$a,rindex($a,'/'),~0,'')for split'/',shift; +$a +} -```perl -sub longest_no_comments { - open my $fh, q(<), $_[0]; - my @m = (0); - (chomp)&&!/[^a-z]/&&($m[0]<=length$_) - &&($_ eq join q(),sort split//,$_) - &&($m[0]==length$_?(push@m,$_):(@m=(length$_,$_))) - while <$fh>; - return "$_[0] > @m"; +my $s; +sub canonical_path_global { +$s=''; +'.'ne$_&&''ne$_&&('..'ne$_?$s.='/'.$_:substr$s,rindex($s,'/'),~0,'')for split/\//,shift; +$s } ``` -## The results... - -Here are the results for each of the four databases: - - * `british-english-small`: max length 6 - 21 words - * abhors - * accent - * accept - * access - * accost - * almost - * begins - * bellow - * billow - * cellos - * chills - * chilly - * chimps - * chintz - * choosy - * choppy - * effort - * floors - * floppy - * glossy - * knotty -* `british-english-large`: max length 7 - 1 word - * billowy -* `british-english-huge`: max length 7 - 4 words - * beefily - * billowy - * chikors - * dikkops -* `british-english-insane`: max length 8 - 1 word - * aegilops - -## Some definitions... - -All the 6 letter words and billowy and beefily are quite common words, but there are three that people may not have heard of these are all species names. - - * **chikors** - An alternative spelling of chukars - A species of partridge native to central Asia (*Alectoris chukar*). - * **dikkops** - A bird of the family Burhinidae. The stone curlew, thick-knee. (From afrikaans) *dik*-*kop* or thick head - * **aegilops** - A genus of Eurasian and North American plants in the grass family, Poaceae. They are known generally as goat grasses. Some species are known as invasive weeds in parts of North America. +## Performance of different methods + +We will look at some different versions of the code: + + * whether we use an array or string to accumulate the resultant path + * Whether we use "readable" code or Perl hacks and tricks + +To see what aspects of our code makes it faster or slower + +### Summary of methods.. + * "Long form" Perl... + * `canonical_path_double` - Using a double loop + * `canonical_path_array` - Using backtracking instead of inner loop + * `canonical_path_string` - Use a string as the accumulator and mapping + * `canonical_path_string_fast` - As above - but using substr/rindex + * "One-liner" perl {arrays} + * `canonical_path_compact` - short version of array code + * `canonical_path_compact_opt` - optimized version of above - 1-less regex + * "One-liner" perl {strings} + * `canonical_path_shortest` - most compact method + * `canonical_path_short` - compact method + * `canonical_path_fast` - replace one of the regex with equality checks + * `canonical_path_fastest` - replace other regex with substr/rindex + * `canonical_path_global` - as fastest but with global variable... + +### Performance of each method: + +| | Rate | @-st | $-st | $-sh | @-2l | @-cf | @-ft | $-cd | $-fa | $-cf | $-ft | $-gl | +| -------------- | -------: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | +| @-shortest | 17,483/s | -- | -21% | -22% | -29% | -30% | -37% | -38% | -43% | -52% | -59% | -59% | +| $-shortest | 22,026/s | 26% | -- | -2% | -11% | -11% | -21% | -21% | -28% | -40% | -48% | -49% | +| $-short | 22,472/s | 29% | 2% | -- | -9% | -10% | -20% | -20% | -26% | -39% | -47% | -48% | +| @-code-2loop | 24,631/s | 41% | 12% | 10% | -- | -1% | -12% | -12% | -19% | -33% | -42% | -43% | +| @-code-fastest | 24,876/s | 42% | 13% | 11% | 1% | -- | -11% | -11% | -18% | -32% | -42% | -42% | +| @-fastest | 27,933/s | 60% | 27% | 24% | 13% | 12% | -- | -0% | -8% | -24% | -35% | -35% | +| $-code-fast | 28,011/s | 60% | 27% | 25% | 14% | 13% | 0% | -- | -8% | -24% | -34% | -35% | +| $-fast | 30,488/s | 74% | 38% | 36% | 24% | 23% | 9% | 9% | -- | -17% | -29% | -29% | +| $-code-fastest | 36,765/s | 110% | 67% | 64% | 49% | 48% | 32% | 31% | 21% | -- | -14% | -15% | +| $-fastest | 42,735/s | 144% | 94% | 90% | 74% | 72% | 53% | 53% | 40% | 16% | -- | -1% | +| $-global-fast | 43,103/s | 147% | 96% | 92% | 75% | 73% | 54% | 54% | 41% | 17% | 1% | -- | + |
