diff options
| author | James Smith <js5@sanger.ac.uk> | 2021-05-06 15:22:19 +0100 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2021-05-06 15:22:19 +0100 |
| commit | e97c3e59cb2aec7e0fc3d12904536049aec4a60f (patch) | |
| tree | b4ea0b095f90d752f682de6866295f307c96947b /challenge-111/james-smith | |
| parent | 95dcbfcf8f6af38ac4039544183d8b03abffc716 (diff) | |
| download | perlweeklychallenge-club-e97c3e59cb2aec7e0fc3d12904536049aec4a60f.tar.gz perlweeklychallenge-club-e97c3e59cb2aec7e0fc3d12904536049aec4a60f.tar.bz2 perlweeklychallenge-club-e97c3e59cb2aec7e0fc3d12904536049aec4a60f.zip | |
Update README.md
Diffstat (limited to 'challenge-111/james-smith')
| -rw-r--r-- | challenge-111/james-smith/README.md | 638 |
1 files changed, 360 insertions, 278 deletions
diff --git a/challenge-111/james-smith/README.md b/challenge-111/james-smith/README.md index a83ed7fa85..aa4c6ddce0 100644 --- a/challenge-111/james-smith/README.md +++ b/challenge-111/james-smith/README.md @@ -1,337 +1,419 @@ -# Perl Weekly Challenge #110 +# Perl Weekly Challenge #111 -# Challenge 1 - valid phone numbers... +# Challenge 1 - Search Matrix -You are given a text file - Write a script to display all valid phone numbers in the given text file. +**You are given 5x5 matrix filled with integers such that each row is sorted from left to +right and the first integer of each row is greater than the last integer of the previous +row.** -## Solution 1 +**Write a script to find a given integer in the matrix using an efficient search algorithm.** -This is what Perl was designed for - pattern matching strings. The format we accept numbers is one of: +## Solution to challenge 1 -``` - +nn nnnnnnnnnn - (nn) nnnnnnnnnn - nnnn nnnnnnnnnn -``` +Today's is an interesting challenge ... to check to see if a value is in the matrix.... + +Initially this looks like we need to define some efficient sorting algorithm that flattens +the array and then searches it {e.g. binary search}. -We can write this as a single regex - there are two parts to the pattern - the prefix and the number -itself. The latter is just a simple number match {we don't specify length as in some countries this -is of different lengths}. +We implement this... but we compare it against some alternative simple methods. We don't +get the result we expect... -We group the three prefix patterns into a group match with `(`s and `|`s - remembering to add `?:` so -we save memory by not storing the match. +## Binary search -We wrap this regex in a function call: +We first flatten the array. Then we loop until the array is empty or it has a solitary value + + If we get a match with the "middle" value we return 1; + Otherwise we remove the half of the list using `splice` which + the number isn't in. + +We return 0 if the array has length 0. If the length is 1 we return +1 or 0 depending on whether the array is search value or not. -``` perl -sub is_valid_phone_number { - return m{\A\s*(?:[+]\d+|00\d+|[(]\d+[)])\s+\d+\s*\Z}; +```perl +sub find_val_search { + my( $val, $m, @list ) = ( $_[0], 0, map { @{$_} } @{$_[1]} ); + + $list[ $m = @list >> 1 ] == $val ? ( return 1 ) + : $list[ $m ] > $val ? ( splice @list, $m ) + : ( splice @list, 0, $m+1 ) + while @list>1; + + return @list && $list[0] == $val ? 1 : 0; } ``` -or "commented" using the "x" modifier... - -``` perl -sub is_valid_phone_number { - return m{ - \A # Start of line - \s* # Possibly white-space - (?: # Prefix - one of: - [+]\d+ | # +{digits} - 00\d+ | # 00{digits} - [(]\d+[)] # ({digits}) - ) - \s+ # Some white-space - \d+ # String of numbers - \s* # Possibly white-space - \Z # End of line - }x; +## The other (better) methods: + +### grep on the flattened array + +```perl +sub find_val_grep_map { + my($v,$m)=@_; + return 0 + grep { $_ == $v } map { @{$_} } @{$m}; } ``` -We can then just use this to grep over the lines of the file.... +### reducing array with grep and then combining with map + +Uses `grep` on each sub array to return either an +empty array or an array containing the match. +We then use `map` to combine the arrays. + +The resultant array will have length `0` or `1`. -``` perl -print grep { is_valid_phone_number($_) } <>; +```perl +sub find_val_map_grep { + my($v,$m)=@_; + return 0 + map { grep { $_ == $v } @{$_} } @{$m}; +} ``` +## Efficiency + +We use `cmpthese` to compare the performance... + +Our methods are: + * `find_val_search` + * `find_val_grep_map` + * `find_val_map_grep` -note we use `print` rather than `say` as we haven't stripped the return character from the line! +Timings using `Benchmark::cmpthese` -# Challenge 2 - transposing a file +| | Rate | Search | Grep-Map | Map-Grep | *Flatten* | +| --------- | -------: | -----: | -------: | -------: | --------: | +| Search | 4,859/s | -- | -10% | -33% | *-42%* | +| Grep-Map | 5,394/s | 11% | -- | -25% | *-36%* | +| Map-Grep | 7,210/s | 48% | 34% | -- | *-14%* | +| *Flatten* | 8,418/s | *73%* | *56%* | *17%* | --* | -You are given a text file. Write a script to transpose the contents of the given file. +Notes: -## Some background - I've seen this before! + * *Flatten* is for comparison only - it actually does nothing other than flatten + the array - this highlights how efficient each algorithm is (and can be) + + * So we see that the map_grep solution is 50% more efficient than the search + algorith (this is true for all search method algorithms which flatten + the array first); + +## Not flattening array + +As we have a limit on performance with the flattening operation. To +improve efficiency we will need to consider a different approach: **To +not flatten!** -Now this is a problem I have come across before - I work at a genomic institute which handles large -amounts of data - some of these are in the forms of very large tab or comma separated value files -containing many 1000s of rows and many 1000s of columns, e.g. gene expression vs samples, genes vs -variations. These files can be 10G or larger in size.... +### Do not flatten - find the row containing the number -Often these files are generated by automatic processes - but when you want to perform the next step -you would rather work down columns rather than across rows...! + * Rather than using `map`/`grep` on the outer array, we instead use + a "hard coded" binary search. -The problem as was posed on a Python mailing list is that this was taking for ever to run.... -Firstly moving back to python 2 helped, but unless the scripts were run on big memory machine {64G+} -then it was still taking a long time... + * First we return 0 if the number is outside the range of numbers in + the list. -As a non-python developer on the list - I looked to see if this could be easily achieved in Perl, to -show the pythonistas that actually Perl was still a better language for this sort of thing. + * Then we look to see if the number is below the first value of the + 4th row. This means it is in rows 1, 2 or 3. So we test for that. + Otherwise it is in rows 4 or 5, so we can test that as well. + + * This function expression returns which row the number *could be in* -Investigating the problem I realised that the method they were using was a slurp and print model.... -The problem with that for such large files was memory. Once slurped, chopped etc the machine was -swapping OR running out of memory. So had to come up with a cleaner script... I will outline 3 -methods of performing this. +```perl + $v < $m->[3][0] + ? ( $v < $m->[1][0] ? 0 : $v < $m->[2][0] ? 1 : 2 ) + : ( $v < $m->[4][0] ? 3 : 4 ) +``` -## Solution 2a - The simplest solution - load in and split into arrays of arrays. +This leads the whole function to be: + +```perl +sub find_val_dnf { + my($v,$m) = @_; + return $v < $m->[0][0] || $v > $m->[4][4] + ? 0 + : 0 + grep { $v == $_ } @{$m->[ $v < $m->[3][0] + ? ( $v < $m->[1][0] ? 0 : $v < $m->[2][0] ? 1 : 2 ) + : ( $v < $m->[4][0] ? 3 : 4 ) + ]}; +} +``` -The simplest solution is to slurp the data in the file into an array of array(ref)s. +### Do not flatten - optimal solution -This is just a case of a `map` which splits the line into it's chunks - we use chomp -to remove any unneeded line endings. + * As well as not using `map` or `grep` on the outside array we + do so on the inner array. -Once we have an array of arrays we take the first entry out of each row and printing -these in one line. We repeat this until we reach the end of the lines... Again this -is a simple one liner..... + * Once we have chosen the correct row then we split it into + two halves and check for a match in each half depending on + whether it matches the middle value or is above/below. -``` perl -sub transpose_split { - ## Slurp into array - open my $fh, '<', $_[0]; - my @in = map { chomp;[ split /,/ ] } <$fh>; - close $fh; - ## Generate transpose; - open $fh, '>', $_[1]; - say {$fh} join ',', map {shift @{$_} } @in while @{$in[0]}; - close $fh; +```perl +sub find_val_dnf_optimal { + my($v,$m,$t) = @_; + + return $v < $m->[0][0] || $v > $m->[4][4] + ? 0 + : ( $t = $m->[ $v < $m->[3][0] + ? ( $v < $m->[1][0] ? 0 : $v < $m->[2][0] ? 1 : 2 ) + : ( $v < $m->[4][0] ? 3 : 4 ) + ] ) && + ( return $v == $t->[2] ? 1 : + $v < $t->[2] ? + (( $v == $t->[0] || $v == $t->[1] ) ? 1 : 0) : + (( $v == $t->[4] || $v == $t->[3] ) ? 1 : 0) ); } ``` -For an 1000 x 1000 file (6.75Mb), this uses approximately 90Mb of memory. This is -approximately 12-15 x space multiplier.. +Timings using `Benchmark::cmpthese` + +| | Rate | Search | Grep-Map | Map-Grep | *Flatten* | DNF | DNF Opt | +| --------- | -------: | -----: | -------: | -------: | --------: | -----: | ------: | +| Search | 4,859/s | -- | -10% | -33% | *-42%* | -76% | -79% | +| Grep-Map | 5,394/s | 11% | -- | -25% | *-36%* | -73% | -77% | +| Map-Grep | 7,210/s | 48% | 34% | -- | *-14%* | -64% | -69% | +| *Flatten* | 8,418/s | *73%* | *56%* | *17%* | *--* | *-59%* | *-63%* | +| DNF | 20,284/s | 317% | 276% | 181% | *141%* | -- | -12% | +| DNF_opt | 22,989/s | 373% | 326% | 219% | *173%* | 13% | -- | + +We can see that this "optimal method" is somewhere betwen 4.5 and 5 times more efficient +than the "search" function. + +## Test script -## Solution 2b - Load in but don't split into arrays +For completeness - this is the test and benchmarking script -This time we want to look at how we can save memory - one of the problems with the -solution above is the huge overhead of storing data in an array. +```perl +#!/usr/local/bin/perl + +use strict; + +use warnings; +use feature qw(say); +use Test::More; +use Benchmark qw(cmpthese); + +my $matrix = [ + [ 1, 2, 3, 5, 7 ], + [ 9, 11, 15, 19, 20 ], + [ 23, 24, 25, 29, 31 ], + [ 32, 33, 39, 40, 42 ], + [ 45, 47, 48, 49, 50 ], +]; + +## Create a test set - numbers from -10 to 60... +my %TEST_SET = map { $_ => 0 } (my @KEYS = -10..60); + +## Set all to 0, and then iterate through the elements of the matrix +## and set the numbers in the list to 1.... + +$TEST_SET{$_} = 1 foreach map { @{$_} } @{$matrix}; + +## Run the original PWC test examples... +is( find_val_search( 35, $matrix ), 0 ); +is( find_val_search( 39, $matrix ), 1 ); +is( find_val_map_grep( 35, $matrix ), 0 ); +is( find_val_map_grep( 39, $matrix ), 1 ); +is( find_val_grep_map( 35, $matrix ), 0 ); +is( find_val_grep_map( 39, $matrix ), 1 ); +is( find_val_dnf( 35, $matrix ), 0 ); +is( find_val_dnf( 39, $matrix ), 1 ); +is( find_val_dnf_optimal( 35, $matrix ), 0 ); +is( find_val_dnf_optimal( 39, $matrix ), 1 ); + +## Now run our full test set - from -10 to 60. This covers +## all cases within the list and a few either side... + +is( find_val_dnf_optimal( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; +is( find_val_dnf( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; +is( find_val_search( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; +is( find_val_map_grep( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; +is( find_val_grep_map( $_, $matrix ), $TEST_SET{$_} ) foreach @KEYS; + +done_testing(); + +cmpthese(100_000, { + q(DNF_opt) => sub { find_val_dnf_optimal( $_, $matrix ) foreach @KEYS; }, + q(DNF) => sub { find_val_dnf( $_, $matrix ) foreach @KEYS; }, + 'Flatten' => sub { flatten( $_, $matrix ) foreach @KEYS; }, + 'Search' => sub { find_val_search( $_, $matrix ) foreach @KEYS; }, + 'Grep-Map' => sub { find_val_grep_map( $_, $matrix ) foreach @KEYS; }, + 'Map-Grep' => sub { find_val_map_grep( $_, $matrix ) foreach @KEYS; }, +}); +``` -To resolve this we can just slurp all the data into the array, and then use a regex -to strip the characters up to the `,` or "end of line" and output them, until the -strings are empty. +## Addendum - using hash keys -``` perl -sub transpose_regex { - ## Slurp into array - open my $fh, '<', $_[0]; - my @in = <$fh>; - close $fh; - ## Generate transpose; - open $fh, '>', $_[1]; - say {$fh} join ',', map { s{^(.*?)[,\r\n]+}{}; $1 } @in while $in[0]; - close $fh; +As some people had used hash slices to solve this problem - I thought I +would look into this to. I added the following function: + +```perl +sub find_val_hash { + my %list; + @list{ map { @{$_} } @{$_[1]} } = (); + return exists $list{$_[0]} ? 1: 0; } ``` -This is more efficient - for the 1000 x 1000 example it only uses approx 9Mb of -memory this is on only about a 1.33 x space multiplier... This still becomes more -of an issue as the file size gets larger -## Solution 2c - The ultimate large file solution - using `tell`, `seek` and `read`; +which converts the nested arrayref into an array for the keys of `%list` +and then returns 1 if the value of a key in the hash. Like other methods +this throws away the fact that the array is originally sorted... -Finally a solution for "very" large-files... +Looking at performance - this is approximately 53% slower than the search +algorithm, and therefore less than a tenth of the speed of the optimal +solution. -Rather than slurping all the data into memory in one go - we instead pull content of -each of the lines a few bytes at a time, working with that - and going back to the -file system to retrieve more data. +## Addendum - passing an array rather than an arrayref -Perl has some simple functions for moving the "head" around in the file: +I quickly checked to see if by passing the matrix as an array rather than +an arrayref was any faster - one less derefernce - but it is not - it +appears to be the same speed +/- 2%. In most cases it appears to be +the same speed or slightly slower. - * `tell` - return the current location in the file - * `seek` - set the current location in the file. +## Final timing run with additional two methods in... -Additionally it has another way of accessing the file content rather than `<$fh>`, -by using the lower level `read` command. This reads contents for the file and -stores it into a scalar variable. +| Method | Rate | Hash | Search | Grep-Map | Map-Grep | *Flatten-@* | *Flatten* | DNF | Optimal | +| ----------- | --------: | ------: | -----: | -------: | -------: | ----------: | --------: | -----: | ------: | +| Hash | 2,035/s | -- | -59% | -62% | -72% | *-74%* | *-76%* | -90% | -91% | +| Search | 4,916/s | 142% | -- | -8% | -32% | *-37%* | *-41%* | -76% | -79% | +| Grep-Map | 5,328/s | 162% | 8% | -- | -27% | *-32%* | *-36%* | -73% | -77% | +| Map-Grep | 7,252/s | 256% | 47% | 36% | -- | *-8%* | *-13%* | -64% | -69% | +| *Flatten-@* | *7,855/s* | *286%* | *60%* | *47%* | *8%* | *--* | *-6%* | *-61%* | *-66%* | +| *Flatten* | *8,361/s* | *311%* | *70%* | 57% | *15%* | *6%* | *--* | *-58%* | *-64%* | +| DNF | 20,080/s | 887% | 308% | 277% | 177% | *156%* | *140%* | -- | -13% | +| Optimal | 23,095/s | 1,035% | 370% | 333% | 218% | *194%* | *176%* | 15% | -- | -So this requires a "double" pass solution - first of all we use `<$fh>` to retrieve -every line of data from the file, noting the start and end of the line {in byte offsets} and -remebering the first `$BYTES` bytes from that line {in this example we set this to 1000 bytes}. +# Challenge 2 - Ordered Letters -We then use a similar approach to the previous one - where we use a regex to find the value -of the first column. But if we can't find a comma OR the end of the line - we retrieve another -`$BYTES` bytes of data to the string (until we can find a comma or the end of the line). +**Given a word, you can sort its letters alphabetically (case insensitive). +For example, “beekeeper” becomes “beeeeekpr” and “dictionary” becomes +“acdiinorty”.** -If we get to the end of the line we only retrieve the number of bytes left in that line. +**Write a script to find the longest English words that don’t change when +their letters are sorted.** + +## Dictionaries available + +In this example we will consider the dictionaries available in Ubuntu, +as I am in UK - I will use the British English dictionaries: + + * `/usr/share/dict/british-english-small` + * `/usr/share/dict/british-english-large` + * `/usr/share/dict/british-english-huge` + * `/usr/share/dict/british-english-insane` + +## Assumptions into what really is a word + +These dictionaries contain a number of "Words" which we wouldn't +necessarily consider to be words. To simplify this we will only +chose words which are all lowercase. + +So we will filter these words out... + +A summary of the four dictionaries above give the following counts: + +| Name: | # words | # trimmed | +| ------ | ------: | --------: | +| small | 50,790 | 39,781 | +| large | 166,828 | 113,695 | +| huge | 344,861 | 245,593 | +| insane | 654,299 | 427,891 | + +## Solution to challenge 2 + +We will collect all of the words that meet this requirement. +We will collect them in an array `@max`. The first value will +be the length of the words in the list and the rest + +There are 4 parts to the loop... + + * The filters as above... + + `!/[^a-z]/` + +* A filter that skips words shorter than the max length + + `$max[0] <= length $_` + + * The calculation to see if the word matches + + `lc $_ eq join q(), sort split //, lc $_` + + * The code to either replace the array `@max` with the + newer longer word *or* to push the new word to the end + of the list. + + `$max[0] == length $_ ? ( push @max, $_ ) : ( @max = (length $_, $_) )` + +The full function is.... + +```perl +sub longest { + open my $fh, q(<), $_[0]; + my @max = (0); + (chomp) ## Remove newline character + #&& !/\W/ ## Remove words with non-alpha chars + && !/[^a-z]/ ## Remove words starting with a capital + && ( $max[0] <= length $_ ) + ## Remove words that are too short + && ( $_ eq join q(), sort split //, $_ ) + ## Check the word is unchanged when the + ## letters are sorted + && ( $max[0] == length $_ + ? ( push @max, $_ ) + : ( @max = (length $_, $_) ) + ) + ## If the word is the same length as the maximal word + ## push it onto @max - so we store all the longest words + ## with maximum length. + ## If the word is longer than the max length (1st entry + ## in @max - reset max to include the new max length and + ## the word. + while <$fh>; + return "$_[0] > @max"; + ## Return the name of the file used, the size of the words + ## and a complete list of the words of that length. +} +``` + +If you like the code more compact - here it is without the comments... ```perl -sub transpose_seek { - my ( $prev, @pos ) = ( 0 ); - - open my $fh, '<', $_[0]; - ( push @pos, [ $prev+$BYTES, tell $fh, substr $_, 0, $BYTES ] ) && - ( $prev = tell $fh ) while <$fh>; - - open my $ofh, '>', $_[1]; - while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { - my $j = ''; - foreach( @pos ) { - while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { - seek $fh, $_->[0], 0; ## 0 = from start of file! - read $fh, - $_->[2], - $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], - length $_->[2]; - $_->[0] += $BYTES; - } - $_->[2] =~ s{^([^,\r\n]+)[,\r\n]*}{}; - print {$ofh} $j, $1; - $j ||= ','; - } - say {$ofh} ''; - } +sub longest_no_comments { + open my $fh, q(<), $_[0]; + my @m = (0); + (chomp)&&!/[^a-z]/&&($m[0]<=length$_) + &&($_ eq join q(),sort split//,$_) + &&($m[0]==length$_?(push@m,$_):(@m=(length$_,$_))) + while <$fh>; + return "$_[0] > @m"; } ``` -### Notes on this code... +## The results... - * `@pos` is an array of triples: +Here are the results for each of the four databases: - * **`[0]`** - the current offset in the file for the next chunk of this line; +``` +british-english-small - max length 6 - 21 words - * **`[1]`** - the end of the current line; + abhors accent accept access accost almost + begins bellow billow cellos chills chilly + chimps chintz choosy choppy effort floors + floppy glossy knotty - * **`[2]`** - the buffer for this line (initially the first `$BYTES` bytes) +british-english-large - max length 7 - 1 word - * Let's look at the three while loops.... - - * The push/while line slurping in data - - We use `while <$fh>` here rather than `foreach` as `while` doesn't accumulate - data into an array, where as `foreach <$fh>` does, saving memory. - - ``` perl - while( <$fh> ) { - push ( @pos, [$prev+$BYTES,tell $fh,substr $_,0,$BYTES]) && - ( $prev=tell $fh ); - } - ``` - - * This uses a lot of ***1-liner*** tricks - first of all we push the start, end and first - `$BYTES` bytes into the @pos array. This always evaluates to true, so we run the second - bracketed statement. This (a) sets the start of the next block to the end of this current - line if we have overshot, and then (b) updates the `$prev` value so we know where the - next line starts in the file.... - - * Now let us look at the 2nd while loop (outer)... - - ``` perl - while( $pos[0][0] < $pos[0][1] || length $pos[0][2] ) { - - } - ``` - This is checking that (a) we haven't run out of data to be retrieved for the first and - (b) haven't run out of data we have already retrieved. - - - * Finally the 3rd while loop (inner)... - - ``` perl - while( $_->[2] !~ m{,} && $_->[0] < $_->[1] ) { - seek $fh, $_->[0], 0; - read $fh, $_->[2], $_->[1]-$_->[0] > $BYTES ? $BYTES : $_->[1]-$_->[0], length $_->[2]; - $_->[0] += $BYTES; - } - ``` - In this loop we see if the row does not contain a comma AND there is data left... If this is the - case we have to retrieve more data from the file. We do this by first `seek`ing to the location - in the file that we need to get data from. We then retrieve the either $BYTES `bytes` of data (or - all the data left for the row {if it is less than `$BYTES` bytes.} - We then update the location for that particular row (by adding `$BYTES` we can ignore the fact - that we overshot. - - Note also we use the 4 parameter version of read. - - `read $fh, $buffer, $bytes, $offset` - - By adding the offset - we can easily append this content onto the end of our buffer string. We have - to use `length $_->[2]` as you can use -ve indecies to read into the buffer with an offset from the - end - but this only works for -1, -2 etc not "-0". - - * We then use the regex trick in 2b to get the first column of the data. - - * Memory usage: - - * This script does not load the file all in one go - so really needs a lot less memory - (vs more disc accesses). It is linear in the number of lines, e.g. for the 1000 line file we load in - roughly 1Mb of data at a time, and the memory usage is roughly 1.3Mb. - - * Rather than pushing to an array for the output, we print immediately. We use a prefix `$j` which - starts with the value `''` so that we don't print a comma before the first entry, and changes to - `','` after the first column (the only time `$js` is "false"). Printing every entry doesn't affect - IO performance as the output is buffered until the print buffer is full - so both methods take the - same time - except the array method uses subtly more memory. - - * Note this is `O(n)` as well as if the rows get longer then the number of bytes used per line does - not increase. - - * Having played a bit - the sweet spot of `$BYTES` lies somewhere between 1K and 2K. Smaller makes the - regex in the `split` more efficient, larger reduces the (read) file IO overhead. - -### Some information about speed/memory etc... - -The following are timings on a single core, 2G RAM, 4G swap machine: - -**Timings:** - -We list these in order of "memory consumption"... - -| Method/size | Time (s) | Kbytes | resident | shared | -| ----------- | --------: | --------: | --------: | -----: | -| Seek small | 0.000 | 16,024 | 7,836 | 5,228 | -| Regex small | 0.000 | 16,024 | 7,836 | 5,228 | -| Split small | 0.000 | 16,024 | 7,836 | 5,228 | -| Seek 1000 | 1.277 | 17,196 | 9,320 | 5,228 | -| Seek 2000 | 5.132 | 18,612 | 10,636 | 5,228 | -| Seek 5000 | 35.422 | 22,236 | 14,396 | 5,436 | -| Regex 1000 | 1.181 | 24,868 | 17,288 | 5,228 | -| Seek 30000 | 2,390.686 | 53,228 | 40,332 | 2,968 | -| Regex 2000 | 10.596 | 58,160 | 51,376 | 3,140 | -| Split 1000 | 1.054 | 103,620 | 93,100 | 3,204 | -| Regex 5000 | 128.589 | 258,056 | 248,016 | 3,204 | -| Split 2000 | 4.490 | 360,036 | 349,388 | 3,204 | -| Split 5000 | 598.668 | 2,151,664 | 1,423,468 | 2,764 | - -The size is the number of rows/columns - so the "1000" file has 1000 rows and 1000 columns (+row/column labels). - -As a "guestimate" for the 30,000 x 30,000 case for which the seek solution use roughly 50Mb, the regex solution would use 7GB memory and the split method would use about 75GB memory... Both these are more memory+swap than the machine that I'm using has! - -**File sizes:** - -| name | rows | columns | size | row size | -| ------------ | -----: | ------: | ---------: | -------: | -| in-small.txt | 5 | 3 | 61 bytes | 12 | -| in-1000.txt | 1,001 | 1,001 | 6.6 Mbytes | 6.7K | -| in-2000.txt | 2,001 | 2,001 | 27 Mbytes | 13.5K | -| in-5000.txt | 5,001 | 5,001 | 165 Mbytes | 33.6K | -| in-30000.txt | 30,001 | 30,001 | 5.8 Gbytes | 201.0K | - -If we look at the timings by method we can see that for the smaller files the `split` is -the most efficient {but the difference is relatively small}. But as the file size increases -then it soon becomes the least efficient: - -**Timing comparisons:** - -These are based on a single run of the script. When run multiple times the seek for 2000, -can be faster than the split for 2000. Fastest script is in **bold**, slowest in *italics*. -Only the seek method is used for the largest file as the Split/Regex models which cause -the script to fail with an out of memory issue. - -| Size | Split | Regex | Seek | -| -----: | ----------: | ----------: | ------------: | -| small | **0.000** | 0.000 | *0.000* | -| 1000 | **0.934** | 1.293 | *1.346* | -| 2000 | **4.490** | *9.040* | 5.132 | -| 5000 | *527.614* | 130.411 | **35.442** | -| 30000 | - | - | **2,390.686** | - -The efficiency of the seek method for larger files is almost certainly to do with the -improved memory management. + billowy + +british-english-huge - max length 7 - 4 words + + beefily billowy chikors dikkops + +british-english-insane - max length 8 - 1 word + + aegilops +``` + +## Some definitions... + +All the 6 letter words and billowy and beefily are quite common words, but there are three that people may not have heard of these are all species names. + + * **chikors** - An alternative spelling of chukars - A species of partridge native to central Asia (*Alectoris chukar*). + * **dikkops** - A bird of the family Burhinidae. The stone curlew, thick-knee. (From afrikaans) *dik*-*kop* or thick head + * **aegilops** - A genus of Eurasian and North American plants in the grass family, Poaceae. They are known generally as goat grasses. Some species are known as invasive weeds in parts of North America. |
