challenge-099/duncan-c-white/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

Task 1: "Pattern Match

You are given a string $S and a pattern $P.

Write a script to check if given pattern validate the entire string. Print
1 if pass otherwise 0.

The patterns can also have the following characters:
? - Match any single character.
* - Match any sequence of characters.

Example 1:
	Input: $S = "abcde" $P = "a*e"
	Output: 1

Example 2:
	Input: $S = "abcde" $P = "a*d"
	Output: 0

Example 3:
	Input: $S = "abcde" $P = "?b*d"
	Output: 0

Example 4:
	Input: $S = "abcde" $P = "a*c?e"
	Output: 1
"

My notes: oh, cool, simpler than regexes, but nice.  Two basic ways of
doing this: 1). translate it into a regex and use Perl's regex matching,
2). figure out a from-scratch mechanism to solve this problem.

Initially, I did (1) in a few minutes, worked fine.  See ch-1.pl as usual.

After doing task 2, I came back to task 1, and starting to think how to do
(2) - doing it from scratch, and I decided it should be able to extract the
text matching each wildcard ('?' and '*') as well.  See ch-1a.pl (and API.pm
and PosExpr.pm modules) for an incomplete (and much much bigger) attempt:

I defined a set of simple Abstract Pattern Instructions, in API.pm.
For example, a single API might be written as "M'hello'->l", which means:
"Fail unless you can match the literal string "hello" at the End of $s,
assuming you can: store the position at which the trailing "hello" started
in $s into a variable called l".  Another API might be "T l>2", meaning
"only succeed if the value of l is > 2".  Captures are handled by APIs like
"C0 1 l-1", meaning "capture into matchedtext[0] the part of $s from char
pos 1 to char pos l-1".

Then I wrote an API-interpreter for them.  Of course, this has to handle
the general match API (see Interpreter.pm, API.pm and ch-1a.pl for details)
which can match in multiple places - requiring us to try each possible place
recursively.

The API-interpreter successfully does pattern matching, as shown by
"./ch-1a.pl -t" passing all tests.  So this is a successful proof of concept.

Up to the submission deadline, this was all I managed to achieve.  See the
bottom of this README for an update..

Task 2: "Unique Subsequence

You are given two strings $S and $T.

Write a script to find out count of different unique subsequences
in $S matching $T without changing the position of characters.  UPDATE:
2021-02-08 09:00AM (UK TIME) suggested by Jonas Berlin, missing entry [5].

Example 1:

Input: $S = "littleit', $T = 'lit'
Output: 5

    1: [lit] tleit
    2: [li] t [t] leit
    3: [li] ttlei [t]
    4: litt [l] e [it]
    5: [l] ittle [it]

Example 2:

Input: $S = "london', $T = 'lon'
Output: 3

    1: [lon] don
    2: [lo] ndo [n]
    3: [l] ond [on]
"

My notes: nice question.  Of course one could do this by translating $T
into a Task-1-style pattern, so lit becomes *l*i*t* and figuring out
how many different ways that pattern can apply.  But as $T has no meta-chars,
an easy approach may be possible (locate all positions in $S where head($T)
appears, for each of them recursively match subtr($S,pos+1) against tail($T))


Return to Task 1 after the deadline

After the deadline had passed, I spent quite a long time working out
how to translate our patterns such as 'a*c?e' into lists of API calls.
See Translate.pm and ch-1b.pl for the details of what I did.  This half
of the problem was much harder than the Interpreter.

I then invested some more time optimising the Interpreter+Translator:

opt1-noFL:
	Here I realised that "First" (match @ 0) and "Last" (match @ end of
	the string) were special cases, and removed them, thus removing over
	100 lines of Perl without any harm.

opt2-separateRunX:
	I started thinking about merging the Translator and Interpreter,
	here I split out the Interpreter operations for At, Match, Test and
	Capture into separate functions thinking this might make it easier
	to move bits of the Interpreter into the Translator.  It might have
	helped with At, Test and Capture, but Match has to try matching at
	several different places, backtracking, and the structure of this
	would be entirely different in a merged T+I.  So this was a dead end.

opt3-rewrite:
	Here I bit the bullet, and started working out from scratch (reusing
	snippets of code from both Translator and Interpreter) how to do an
	integrated Pattern Match.  I got very bogged down in captures and
	backtracking, before eventually realising that working out which
	substrings to capture should be completely postponed until after the
	match was complete.  All that it needed was to know how many fixed width
	islands (sequences of string literals and '?'s) there were, and
	for each island, the starting position in the string we were matching
	against where that island had matched.  All the captures could be
	worked out from there.  It took two solid evenings work, but eventually
	it worked fine.  See PatternMatch.pm, Tuple.pm (reused ADT) and so on.
	This approach completely abandons the API instructions, the PosExpr
	module and the concept of an interpreter for them.

	It's considerably shorter, and most likely faster (although I haven't
	benchmarked them).

Is it more elegant?  I'm not sure.  There was a great elegance about the
concept of an Abstract Pattern Instruction, with a front end compiling a
pattern into a list of APIs, and a back end interpreting them to perform
the pattern match.  On the other hand, there was a certain elegance about
the concept of fixed width islands, either matched at a fixed location,
or matched floating about, and the startpos array.