Since 2000, November has been National Novel Writing Month (or NaNoWriMo), a chance for people around the world to hunker down and spend a month writing a 50000 word novel. Since 2013, November has also been National Novel Generation Month (NaNoGenMo), a chance for people around the world to hunker down and spend a month writing a program that generates a 50000 word novel.
The NaNoGenMo 2014 Rules are simple: share a 50k+ word novel (and the source code that generated it). There are no rules on meaningfulness or originality, but there is a request that copyright is respected. Beyond that, anything is fair game.
This year, with a little extra time during my Thanksgiving vacation, I decided to join the last quarter of NaNoGenMo. With so little time left, I decided my entry would be a little light remixing of existing work.
In some older novels, authors opted to censor the names of towns and people for any of a variety of reasons: to avoid accusations of gossip or libel; to give the impression the novel really was gossip and thus had to be censored; or possibly even to aid suspension of disbelief by allowing readers to substitute in places and people they were already familiar with. These days, that sort of censorship evokes a different context -- redaction of documents for public release.
For my NaNoGenMo experiment, I wanted to make a tool that created that took source texts and censored them, obscuring targeted portions of the input text. For my first run, I decided that I'd try to target proper nouns for redaction. With that goal in mind, the Redactron!
My first instinct was to use this project as an excuse to try out NLTK, which has lots of options for tagging words, including identifying parts of speech (like proper nouns!). However, NLTK uses lots and lots of corpora, models, and other data to do their tagging. That wasn't quite the lightweight approach to NaNoGenMo I wanted to take -- it was time to try something simpler.
Keeping in mind that a perfect result wasn't necessary to create the same general impression as real proper-noun tagging, I went about trying to get as close as I could with just the standard library and very little code on top of it. I settled on a simple approach:
# Split the text into words: words = set(re.split('\W+', input_text)) # Pick out the words that are upper case: upcase = [word for word in words if len(word) and word.isupper()] #Discard words that appear elsewhere in the text in lower case: only_up = [word for word in upcase if word.lower() not in words] #Redact words that remain where they appear. (See repo for details.)
That approach actually ended up being pretty successful. Here's a representative passage from Jane Austen's Pride & Prejudice, the text I tested with.
Elizabeth Bennet had been obliged, by the scarcity of gentlemen, to sit down for two dances; and during part of that time, Mr. Darcy had been standing near enough for her to hear a conversation between him and Mr. Bingley, who came from the dance for a few minutes, to press his friend to join it. "Come, Darcy," said he, "I must have you dance. I hate to see you standing about by yourself in this stupid manner. You had much better dance."
Exxxxxxxx Bxxxxx had been obliged, by the scarcity of gentlemen, to sit down for two dances; and during part of that time, Mx. Dxxxx had been standing near enough for her to hear a conversation between him and Mx. Bxxxxxx, who came from the dance for a few minutes, to press his friend to join it. "Come, Dxxxx," said he, "I must have you dance. I hate to see you standing about by yourself in this stupid manner. You had much better dance."
Not bad! We've successfully obscured the names but not the 'Come' or 'You' that start sentences.
Of course, there's always room for improvement! A little later in the book is this passage:
"I do not mind his not talking to Mxx. Long," said Miss Lxxxx, "but I wish he had danced with Exxxx."
While most of the identities are obscured there, the unfortunate 'Mxx. Long' has a name that also appears in the text elsewhere as a lowercase word. As it is, the very simple model used for targeting words doesn't do a very good job with names that are also non-name words. A better targeting model might do something like take into account more than one word at a time, or include a list of known titles.
Even more generally, I'd love to take time to split the implementation details out a bit to make the utility more modular and thereby more customizable. Specifically, I think that providing easy ways to switch between targeting models and to switch between censoring methods could be fun! Redacting random words, every third word, or words beginning with some particular letter could potentially produce results that feel very different. Similarly, different replacement options could produce fun results: 'Elizabeth Bennet of Longbourn' currently becomes 'Exxxxxxxx Bxxxxx of Lxxxxxxxx' but could just as easily become 'ɥʇǝqɐzıןǝ ʇǝuuǝq of uɹnoqbuoן' or (using my recent favorite substitution, nearest dinosaur) 'Elmisaurus Belodon of Longisquama'. The possibilities are endless!