1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
|
#! /opt/local/bin/perl
#
# bag-o-sharks.pl
#
# TASK #2 › Frequency Sort
# Submitted by: Mohammad S Anwar
# You are given file named input.
#
# Write a script to find the frequency of all the words.
#
# It should print the result as first column of each line should be the
# frequency of the the word followed by all the words of that frequency
# arranged in lexicographical order. Also sort the words in the
# ascending order of frequency.
#
# INPUT file
#
# West Side Story
#
# The award-winning adaptation of the classic romantic tragedy "Romeo
# and Juliet". The feuding families become two warring New York City
# gangs, the white Jets led by Riff and the Latino Sharks, led by
# Bernardo. Their hatred escalates to a point where neither can coexist
# with any form of understanding. But when Riff's best friend (and
# former Jet) Tony and Bernardo's younger sister Maria meet at a dance,
# no one can do anything to stop their love. Maria and Tony begin
# meeting in secret, planning to run away. Then the Sharks and Jets plan
# a rumble under the highway--whoever wins gains control of the streets.
# Maria sends Tony to stop it, hoping it can end the violence. It goes
# terribly wrong, and before the lovers know what's happened, tragedy
# strikes and doesn't stop until the climactic and heartbreaking ending.
# NOTE
# For the sake of this task, please ignore the following in the input file:
# . " ( ) , 's --
# OUTPUT
# 1 But City It Jet Juliet Latino New Romeo Side Story Their Then West
# York adaptation any anything at award-winning away become before begin
# best classic climactic coexist control dance do doesn't end ending
# escalates families feuding form former friend gains gangs goes
# happened hatred heartbreaking highway hoping in know love lovers meet
# meeting neither no one plan planning point romantic rumble run secret
# sends sister streets strikes terribly their two under understanding
# until violence warring what when where white whoever wins with wrong
# younger
#
# 2 Bernardo Jets Riff Sharks The by it led tragedy
#
# 3 Maria Tony a can of stop
#
# 4 to
#
# 9 and the
# method:
# a bit of NLP for you all. A naive bag of words output by
# frequency. We'll start by pretreating the data: scrub certain
# defined punctuation and possessive case into spaces, and lowercase
# normalize all text. WE will make sure to keep a single hyphen. We
# won't be doing any name recognition so the we won't worry about
# losing capitalization for those entities here and concern
# ourselves rather with making sure "their" and "Their" get counted
# as the same word. This is of course a judgement call and not
# specified behavior but seems fitting to this basic word analysis.
# Consequently the output is slightly different as, for instance,
# 'their' is moved to the second category, and the output is
# actually in lexicographic order as requested, rather than the
# example ASCII sort with capital letters first.
#
# Next-level improvements on this method might be begin to identify
# Named Entities by selectively removing the capitalization of
# letters only at beginning of sentences, that is to say after a
# period or certain punctuation, or at the beginning of a paragraph
# or quote. Then unusually capitalized words could be identified in
# the corpus on basis of their grammarical uniqueness.
#
#
# 2020 colin crain
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
use warnings;
use strict;
use feature ":5.26";
## ## ## ## ## MAIN:
local $/ = undef;
my $input = <DATA>;
## preproc
$input =~ s/[. " ( ) ,]|'s|--/ /xg;
$input = lc($input);
my %bag;
my %freq;
## proc
my @words = split /\s+/, $input;
$bag{$_}++ for @words;
while (my ($key, $value) = each %bag) {
push $freq{$value}->@*, $key;
}
## output phase
for (sort {$a-$b} keys %freq) {
say +(sprintf "%-4s", $_) . join "\n ", sort $freq{$_}->@*;
say '';
}
__DATA__
West Side Story
The award-winning adaptation of the classic romantic tragedy "Romeo
and Juliet". The feuding families become two warring New York City
gangs, the white Jets led by Riff and the Latino Sharks, led by
Bernardo. Their hatred escalates to a point where neither can coexist
with any form of understanding. But when Riff's best friend (and
former Jet) Tony and Bernardo's younger sister Maria meet at a dance,
no one can do anything to stop their love. Maria and Tony begin
meeting in secret, planning to run away. Then the Sharks and Jets plan
a rumble under the highway--whoever wins gains control of the streets.
Maria sends Tony to stop it, hoping it can end the violence. It goes
terribly wrong, and before the lovers know what's happened, tragedy
strikes and doesn't stop until the climactic and heartbreaking ending.
|