The project for the take-home problem is to design and build a crypto-analysis system. For this project we will use messages that have been encrypted using one of two methods (called ciphers) - rotation and substitution. No information regarding the encoding method used to obtain the encrypted text (called a cryptogram) will be given. Your system will input the cryptogram, analyze it, and output the original, unencrypted message.
The Ciphers
The first type of cipher is known as a Caesar, or rotation, cipher. This is
done by replacing each letter of the alphabet by the letter that is a fixed
number of places to the left or right. For this project, only alpha
characters are encrypted; other characters remain the same, so there are only
26 possible ways to rotate the letters. These cryptograms are easily solved
by trying different rotations until you find one that makes sense.
The second type of cipher is known as a substitution cipher. This is done by replacing each letter of the alphabet with a fixed (unique) substitute letter. This technique is used in the Crypto-quote you may have seen in the daily newspaper. These cryptograms are more difficult, since there are 26! ways to generate substitutions. (A letter may not be substituted for itself.) Your program will have to be smarter to decrypt these.
For this problem, space characters, newlines, and punctuation marks will be left in the cryptogram, so your program can use that information to help encrypt the messages. Also, upper and lower case will be the same as they are in the original text.
The Input
Your Crypto-Analysis system will get its input from a file that contains one
cryptogram for your system to analyze and decipher. The files will be
formatted as shown in this example file called
sampleinput.txt. The name of the file you want to encrypt will be
specified on the command line as follows for C++:
decrypt sampleinput.txt
or in Java:
java Decrypt sampleinput.txt
For each of the test cases we provide, there will be a corresponding answer file. These answer files will be formatted like this.
Note that some of these cryptograms have been generated using a rotation cipher, and others have been generated with a substitution cipher. Your goal should be to write a program that can figure out which method was used and produce a solution.
The Output
Your program should send to standard out (stdout) only the "most likely"
decrypted message. It should be formatted exactly like the cryptogram. For
example, for the sample.txt file referenced above, the output would look like
this:
I do not want what I do not have.
Your program must not output anything other than the decrypted message. Case should be preserved in the output. In other words, if 'X' in the cryptogram is 'A' in the output, then 'x' in the cryptogram should be 'a' in the output. Refer to the sample program for a working sample of how your output should be done.
Test Data
We will provide you with a set of test files (the
Practice Set) and solutions to help you with your program.
On March 15, we will test your program with a different set of input files (the Test Set). A system that uses general techniques should work just as well on the Test Set as it does on the Practice Set. A system that has lots of hacks and tweaks based on the Practice Set, however, probably will perform very poorly on the Test Set. WARNING: You will be given the answers for the Practice Set, but your system is not allowed to use them when deciphering the cryptograms!
Submission
The take home problem should be submitted on the day of the contest when your
team checks in. The following must be submitted:
So that we can better understand your program, include comments and use good programming style. Your comments should explain your algorithms and design. Also, write a "README" file explaining what you did and why, and anything else that you tried that you didn't use in your final solution. Explain what you learned while working on the take-home problem. Finally, so that we can continue to improve the contest, please give us some feedback about what you thought about the problem: how difficult or easy it was, what the hardest and easiest parts were, and other topics you would like to see in a take-home problem.
We will recompile your program from source if necessary. Please make sure that your source code compiles with one of the following compilers: Microsoft Visual C++, GCC/G++ (free), or Java JDK 1.4 (free).
Scoring
The performance of each Crypto-Analysis system will be scored based on the
percentage of cryptograms correctly deciphered. An answer will be scored as
correct if it exactly matches the quote from which the cryptogram was created.
Each project will be graded according to the following criteria:
Each solution should be computed in a timely manner, so any system that takes longer than 30 seconds for any given input will be scored as having incorrectly solved that input (after all, if you can't find out where the end of the rainbow is before it disappears, what good is it to know?).
To compute the final score for the take-home problem, each Crypto-Analysis system will be ranked on the criteria listed above relative to the other systems in the competetion. This score will be scaled so that the team with the highest average receives 200 points.
Encouragement
Cracking rotation and substitution ciphers is plenty hard. There's really a
lot to do just to manage the search and implement an evaluation
function, especially for substitution ciphers.
We encourage everyone to have fun with this project and experiment with lots of ideas. While there are some fairly standard ways of approaching this problem, creativity is highly encouraged!
If you have any questions, please check our Frequently Asked Questions page. If you don't find answers there, email your questions to hspc@cs.utah.edu.
The Sample Skeleton Solution
To get you started we have provided you with a basic crypto-analysis system
implemented in both C++ and Java, as well as a dictionary file to use in your
analysis. This system only works on the first type of cipher, called
rotation. It only compares the resulting words from each rotation to words in
the dictionary.
You can use this code as a basis for your solution or you can start completely from scratch. Either way, your program should behave the same as the basic system. Before you start coding, be sure to carefully study the README file for whichever language you choose to use.
Possible Strategies
You may be wondering, "Okay, so now what do we do?". For starters, you might
find it useful to use your favorite Internet search engine to dig deeper into
rotation and substitution cipher issues and techniques.
One important thing to keep in mind is that some of the words in the input message may not appear in your dictionary. In that case, your program will need to figure out which possible solution is most likely to be the correct one. Many clues can be gleaned from the message in this case.
For example, there are only two single-letter words, which reduces the rotation cipher problem to just two possibilities. It's only a small step toward solving a substitution cipher, however, but at least its a start.
Other clues your program may use relate to common words and common letter combinations ("digraphs" and "trigraphs" are 2-letter and 3-letter combinations). You can find information on these with a web search. (Google is your friend!)
One strategy that might be most productive, particularly for a starter, would be to start out with just rotation ciphers. For each of the 26 possible rotations, you could generate a "confidence" value based upon how many words are found in your dictionary, plus other clues that you are able to generate. The rotation that gives the highest confidence value would be your selected output.
Hint: You may be tempted to generate a huge dictionary, but remember that your program must solve a cryptogram in less than 30 seconds in order to count.