#! /opt/local/bin/perl # # text_or_binary.pl # # 28 Task #1: # Write a script to check the file content without actually reading # the content. It should accept file name with path as command # line argument and print “The file content is binary.” or else # “The file content is ascii.” accordingly. # # method: This sounds like a koan: read a file without reading it. So I # interpreted "reading" here as meaning "open it up and examine to # see if the characters print text or pseudo-random garbage." However even # this simplification proves problematic on close examination. # # I considered opening a filehandle, using 'read' to grab a few # thousand bytes, and then using a regex to search for control # characters in the bytestream that should never be present in # textfile, or checking for too many bytes with the high-bit set, or # even checking the bytes as hex digits for randomness. These # approaches would work pretty well for ascii but less so for UTF-8. # For UTF-8 we would need to unpack the octets and see if high-bit # bytes are proper leading bytes for multi-byte unicode characters. # # It's also debatable whether using 'read' at all counts as 'reading the # content', but I have to count that as semantic squabbling and move on. # # The thing is, this statistical method is exactly what the -T and -B # operators do: read in a few thousand bytes and make a good educated # guess based on nonprinting control characters and characters with the # high bit set. -T also does a check to see if it's nothing but valid # UTF-8, high bits and all, so we do that filetest first. So why reinvent # the wheel? Also, if explicitly using 'read' is not allowed then # implicitly doing the same thing beneath the surface leaves us in the # same boat. So here we are. # # It is worthy to note that in the end the analysis remains # statistical and cannot be perfectly accurate. I've altered the # output to reflect this. There will always be edge-case files that # defy accurate categorization. pdf files, for example, contain a lot # of embedded ascii plain text surrounded by binary data, and, # depending on how you look at it, can appear to be either, even # though they should properly be considered a binary file containing # blocks of textual data. But even this can be taken to a pathological # extreme: does a multi-megabyte text file prefaced by a minimal # header of few bytes of binary data suddenly become a binary file? # # I think the answer is context-specific, so can only be “it depends”. # # # colin crain ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## use warnings; use strict; use feature ":5.26"; ## ## ## ## ## MAIN my $file = shift @ARGV; if (not defined $file) { die "enter a valid file path to check status\n" } if (not -f $file) { die "argument \"$file\" does not appear to be a valid file path\n"}; if (-T $file) { say "The file content is most likely text." elsif (-B $file) { say "The file content is most likey binary." } ## it's not exactly clear whether a pathological case can exist that fails both tests, ## but I suspect it may. else { say "problem grokking file $file, cannot decide what it is." }