Cryptanalysis of ciphertext substitution using optimization heuristics Tahar MEKHAZNIA1, , M.Bachir MENAI2,Abdelmadjid ZIDANI3 department, TEBESSA University ALGERIA, mekhaznia@yahoo.fr 2 Computing department science, CCIS, King Saud University, RIADH, KSA, menai@ksu.edu.sa 3 Computing department, BATNA University, ALGERIA, zidani@free.fr 1 Computing Abstract The document presents a first step towards the automation of several techniques for classical cryptanalysis of ciphertexts by substitution ant transposition methods without manual intervention using heuristics algorithms. The tests presented are limited due to the large number of parameters used, including statistical tables of the literary languages of which belong to. The study focuses mainly on the choose of initial parameters, a question that remains unanswered so far. Key words: Heuristic algorithm, substitution cipher, cryptanalysis 1. Introduction Cryptanalysis is the art of transforming ciphertext into its equivalent in readable format without a priori knowing the decryption key. This takes part of the important challenges of current research in data security. The techniques of attacks to the cipherd texts are varied. The most frequent and so, the hardest consist in a brut force where it is necessary to appeal to a multitude of keys in order to obtain a enough plain text. The technique is safe, nevertheless, it consumes abundantly of resources and turns out not interesting in practice. The research focuses on heuristic techniques. These do not however appear safe but in practice, they seem most commonly used for solving a wide range of combinatorial problems. They consist, in a search space in cryptanalysis, to a progressive elimination of key improvements deemed unnecessary to obtain a plain text based on the characteristics of literary language used. In this paper, an overview of these techniques is presented in Section 2, followed by a summary of work in the field in Section 3. Sections 4, 5 and 6 illustrate the mechanism of cryptanalysis by substitution and adaptation to some heuristic algorithms. Other section of the paper shows the experimental part of these algorithms and the associated results. 2. Encryption techniques : Classical techniques produce ciphertexts simultaneously using the substitution and transposition of characters within a text. Ultimately, each ASCII character is replaced by another of the same set using a key. The latter is illustrated by a table with two entries (vectors of 255 characters max), the first consisting of the characters in natural order and the second with the same characters in another order (disorder). Modern techniques use iterative algorithms for the production of complex encryption keys, where the substitution and transposition will be used in level of bits after processing the text to binary code. This family of techniques does not allow to find the best solution (or if it is found, it would be difficult to prove it), but allows to find a good solution at reasonable time [1]. The ant colony algorithms are a class of meta heuristics intended to solve difficult optimization problems. They inspired on the collective behavior of ants including the tracking and deposition of pheromone. In their motion, every ant in the colony indirectly communicates with its neighbors through dynamic changes in their environment and thus, builds a solution that improves as time. The genetic algorithms are stochastic optimization class based on the mechanisms of evolution of the nature: crossings, mutations, selections, etc.... They belong to the évolutionary methods. Being a part of the family of the méta heuristics algorithms, their purpose is to obtain a suitable solution in a reasonable time. 3. Previous works : The use of heuristics for solving optimization problems in cryptanalysis to become an important part in the research of recent years, starting with Peleg & Rosenfeld [2], who modeled the problem in a probabilistic cryptanalysis. Carrol & Martin [3] have developed an approach of an expert system for decryption using relaxation methods. Safavi-Naini & Forsyth [4], Spillman & Al [5] and Clerk [6] who used the methods of simulated annealing and genetic algorithms for solving various instances of decryption cases. Bahler & King [7] have reimplemented the work of Peleg & Rosenfeld using various statistics of occurrence of characters in the literary language. M.faisal & Youssef [8] had implemented various heuristics for cryptanalysis of substitution ciphers. A. D. Dimovski & Gligoroski [9] had used the same techniques for cryptanalysis of ciphertexts by transposition. The work thus listed and many others obviously showed that little research has been devoted to cryptanalysis by substitution and using only basic heuristic methods such simulated annealing and tabou search. Complex algorithms, including GA and ACO were often handled in a non-depth. Their results were not competitive. This is evident, given that they included a large number of parameters sensitive to changes of which it is possible to adjust them only with experiments. 4. Cryptanalyse per substitution : 4.1 Definition Let be an alphabet A(n) = (a0, a1, ..,an-1) and B(n)=(b0, b1, bn-1) another alphabet obtained from A by a bijective function K:A→B, which substitutes a character ai of A by an other of the same set to obtain bi The function K is called encrypton key, it performs a permutation of the entire alphabet of A to obtain a ciphertext B. The function K-1 will realize the inverse work. The cryptanalysis (or decoding) consists in obtaining a plaintext from a ciphertext without, in general knowing the key K. The following example illustrates the sets A and B and the function K. Naturally, this last one can be extended, depending on use to a set of predefined caracters of the ASCII table (integration of numeral caracters if uses of commercial messages for example). K A B ABCDEFGHIJKLMNOPQRSTUVWXYZ POIUYTREZAMSKJHGFDLQNBVCXW CRYPTANALYSEPARSUBSTITUION IDXGQPJPSXLYGPDLNOLQZQNZHJ 4.2 Appearance of characters : The frequency of appearance of some character of the alphabet within a given text is different from a language in the other one. Also, it is also different in texts at the level of the same language as in the case of literary, political or commercial texts. The frequency of appearance of the alphabet in an English text (unigram) is presented in the following order : ETAON RISHD LFCMU GYPWB VKXJQ Z [10]. In other words, the letter E is the one which appears most in a text. The frequency of appearance of the pairs of letters (bigrams) is given by the following order: TH HE AN RE ER IN ON AT ND ST ES EN OF TE ED OR TI HI AS TO and the repetition of similar letters within the same word is given by LL EE SS OO TT FF RR NN PP CC. This ranking is not fixed in all cases. Various other projects are present in the literature [11] [12]. The ICE is the most famous. It includes statistics of several variants of the English language from a dozen English-speaking countries [13]. However, exceptions are always present in specific texts, because it can distort the rule if we processes a text on the X-ray technology or a story about the life of the Zebras in Qatar where the characters less frequeted appears most. In general case, statistics of the average appearance of characters were compiled in tables, called frequency tables of characters. They are used as references when deciphering a text in order to determine the nature of a character according to its frequency of appearance in the text 4.3 Index of coincidence : Let be a text t of length n. The index of coincidence of a character c of t is given by the relation : I c (t ) = z 1 ∑ pi ( pi − 1) ( n / n − 1) i = a Where pi is the number of occurrences of the character c in the text t consisted of 26 letters of the alphabet. For a text of another type, such as commercial, other considerations should be taken as the numbers. In theory, the index of coincidence for every character is equal to 1/26≈0.04. In reality, some characters appear more than the others as mentioned above. The global index of coincidence of every character is given by the relation: z I c = ∑ pi i=a The value of this index varies from a language to the other one. It is for example, for the English equal to 0.065, and 0.074 for the French language. 4.4 Cost function : Using frequency tables, while decrypting, the difference of cost between the values of original character and the one with which it was substituted is more small that is closer to the exact substitution to obtain a clear text. If this value is zero, that character is the right choice. Of course, this treatment can be extended to bigrams and trigrams. The economic function of the cost of a text is illustrated by the relationship cos t ( K ) = α ∑ R U − D U + β ∑ R B − D B + γ ∑ R T − D T Where K denotes the key by which the text has been deciphered, R, D designates the portions of the ciphertext and the plaintext obtained after decryption, U, B and T are references to tables used: unigram, bigram or more. The cefficients α, β and γ between 0 and 1 can improve the function. Their values will be justified in experiments. 5. The ACO Technique : 5.1 Definition During its displacement, an ant delivers a uniform and continuous quantity of pheromone on its way. The choice of its direction is subordinated by the trace of pheromone delivered by its previous. The pheromone evaporates on contact with air in a constant manner also. Tracks less pheromented disappear in a progressive manner. After a number of movements, ants will tend to frequent paths richest in pheromone that resist to evaporation and provide the optimal distance between the nest and the food. moves on the arcs already visited. Starting from a node i, the choice of the next node j depends on the distance d(I,j) between them and the amount of pheromone τij on the arc (i,j). It is defined by the equation: p(i → j ) = τ (i, j ) a d (i, j )b ∑ (i, j ) a d (i, j )b where a and b, variables, of “tunning” included in 0 and 1 and will be justified in experiments. c. Updating the pheromone At the end of each movement, an update of the pheromone will be made on the arc in question by the relationship: τ(i,j)= τ(i,j)+Δτ(i,j) Where Δ(i,j) is a positive quantity dependent on the version of the used algorithm. For example, for the case of a virtual ant, it is equal to Q/L where L being the length of the hamiltonien path visited by the ant and Q, the cost of the text generated.. The amount of pheromone is inversely proportional to the cost of the text. It is more important as we approach the plaintext. 5.2 Adapting to the problem : d. Evaporation a. Initial data It will be made in a discreet way during regular temporary intervals. In other words, after a set number of ant movements according to the equation: - Field exploration is a strongly connected graph of 26 nodes (letters of the alphabet). It can also be extended to other characters (space for example) where each move of an ant from one node to another corresponds to a substitution of one character with another. The journey ends when all nodes have been visited. In practice, all the characters of ciphertext have been substituted by other characters in order to obtain a clearer version of the text. - To have a homogeneous movement, the ants are initially distributed randomly on the nodes of the graph: ithis is a basic key K0. - The distance d(i,j) between two nodes, parameter not significant in cryptanalysis, can be obtained from the cost function by the relation: d(i,j) = cost(i,j)c with c, a tunnig parameter between -1 and 1, justified in experiments. b. Itérations The movement of ants is a discrete manner by leaving of initial node to make a Hamiltonian path. A control function is needed to avoid τ(i,j)=(1-ρ) τ(i,j where ρ is a constant between 0 and 1 that is important to choose because if it is close to 0, the arc in question tends to be abandoned because it is devoid of pheromone. If the constant is close to 1, the arc will be saturated and therefore visited permanently from which a rapid convergence of the solution and therefore persistence of bad solutions. 5.3 Algorithm AntSystem The proposed algorithm contains, in a implicit way, the stages which reflect the movement of ants as well as in the update of pheromone. It will be defined as follows: Build an initial solution (generally random), Repeat Improve the solution by choice of new roads Update of the pheromone Until (better solution or max of iteration) In a more explicit case, the algorithm considered, dedicated to the cryptanalysis will take the following shape: Calculate the cost of the initial text (That is S_opt), Determine the distances between the various arcs, Fix the period of evaporation Evap, Place m ants on nodes of graph randomly, For nb_iter = 1 to max_iter do For nb_ant = 1 to m do Build a hamiltonien path S( nb_ant ), Calculate the cost C of the solution S(nb_ant ), If ( nb_iter%Evap )= 0 evap_pheromone Endfor If (S(nb_ant) is better than S_opt) S_opt=S(nb_ant) Endfor 6. genetic Algorithm 6.1 Definition : A genetic algorithm uses the concept of the natural evolution. Basing itself on an initial population of individuals, operations selection, crossover and mutation are operated on individuals to produce a generation with a party and, according to certain criteria, is included in the initial population. So, and after a certain number of iterations, the initial population is transformed towards a new shape having characteristics considered as satisfactory with regard to the defined objective. The genetic algorithms were successfully used to break complex cipher as Enigma encryption [15]. 6.2 Adaptation to the problem: a. Basic data : - The basic population is a table containing a finite number of keys. - A key K (chromosome) is a character string of 26 or 27 letters (alphabet and space). A character within the key is a gene. - Each key Ki is estimated according the cost function defined in §4.4. Its value being Cost(Ki). b. Initialisation : A function for generating random keys is launched. The table of the population is updated ensuring remove duplicates. Each key is evaluated using a ciphertext of the test database. c. Iterations: All the characters is the same for all the keys. Only the position of the characters in the key differentiates between these keys themselves. At each iteration, the following opérations will be executed in order: • Selection: The table of the population is initially sorted by the cost of each key. A selection of Np keys for reproducing the next generation is made. Whether by rank, roulette, tournament, or simply natural, the best choice will be justified in experiments. • Crossing: It is to swap gene segments between the parents. This process encourages the exploration of the search space and provides a sweeping genetic material, however, may cause the divergence of the solution or generate duplicates if the operator selection is a misnomer. The number of crosspoints loci and the probability of crossover Pc are variable and can introduce more diversity among individuals. Swapping characters in key can generate duplications at some of them with no other characters in the key. A challenge function key order is necessary in this case. Two tests were made: - A bilateral cross where exchange of segments takes place between two parents. To limit the exploration space, a single point cross was chosen. - A unilateral cross is to swap segments within a single parent. In this case, the population apte of breeding will be halved. • Mutation The mutation operator is to reverse a gene with a low probability (of the order of 10-2). In our case, a gene is a character, its reversal should be done with another character from 26 (or 27 as appropriate), then returns to a crossing of two distant parts within the same chromosome. • Replacement: Whether stationary, elite or otherwise, various alternative cases are tested using a corpus on the language used. The replacement of the population is conditioned by the absence of duplicates. An audit function is triggered after the completion of each operator.. 6.3 Algorithm GeneticSystem: The GS algorithm proposed achieves the various operations to generate populations by natural genetic evolution. It includes the following tasks: Creating a population of random initial key Evaluation of each key (depending on cost function used) Repeat Select Np keys Cross Keys Mute characters within each key Evaluate new keys Population replacement Until (acceptable solution or MaxGeneration) The best results obtained with a text of 150 characters and a colony of ants contains 25 to 90 are illustrated in the following table: 7. Experiments 7.1 Test parameters: The algorithms tested include a substantial number of settings where it would be difficult to treat them simultaneously. It would also be difficult to set some parameters in the absence of effective mathematical model to justify this fact. The exhaustive testing consume considerable resources, however, preliminary tests were made to fix and, in approximate parameters necessary for the conduct of the algorithms. Similarly, the results of some tests have been reimplemented as a baseline for further testing. The end results look more interesting, especially in resource consumption. The following table shows some values of parameters: Parameter α, β, γ Range 0-1 Step 0.1 a, b 0-1 0.05 c -1-1 0.1 0-1 10-100 5-100 50-600 20-200 1(ind/2) 1-3 0.005 50-150 0.1 5 5 10 10 2 τ ρ cevap Nb_ant gen ind Np Pc Pm Maxcars 1 10 Signification Parameters of cost function Prob. Of choice next direction parameter of length of arcs Quantity of ph deposed evaporation evaporation cycle Number of ants Number of generations Number of individus Number of parents Nbre of crossing pts Prob of mutation Size of ciphetext The experiments were operated on diverse texts encrypted with 3 kinds of keys: simple, as Cesar's keys or AlBash, average as key of Vigenere and more difficult as that of Delastelle 7.2 Variant algorithms: a. Real ants: A real ant deposits pheromone during its movement in a homogeneous and continuous manner. Its path ends at the last node of graph. However, it may hit a dead end (dry arc for example) and causes a chain blocking nearby that attract other ants to progressively due to surphéremontation arcs in the same portion of the graph. In cryptanalysis, this algorithm is used in a reduced way. It ends prematurely if part of the plaintext has been revealed. By continuing its execution causes undesirable re-encryption of the text. The results serve as a platform for further testing for other algorithms. Ant 30 22 77 Parameters α= 0.7, β= 0,4, γ= 0,5, τ=0,5, Cevap=77 α= 0.2, β= 0,4, γ= 1,0, τ=0,2, Cevap=52 α= 0.9, β= 0,6, γ= 1,0, τ=0,1, Cevap=52 Key Match Car. Max Avg Simple 26 21,4 Middle 18 16,3 Difficult 12 10,8 b. Virtual ants: To avoid stagnation of pheromone on portions of the search space, real drawback of the ant, virtual ants avoid this act by depositing the pheromone during the returns path when it has done it successfully. Similarly, the amount deposited is proportional to the length of this path. Of course, this fact requires a memory of the accomplishments and an additional step for return. In practice, this algorithm allows to know the cost of the decrypted text before the decryption operation. However, we can ignore the corresponding key if it is not interesting. Concerning the filing of the pheromone, two alternatives are put to the experiment: - The amount deposited is proportional to the length of the path. It is defined by the relation: τ (i, j) = τ (i, j) +cout (i, j) /Σ cout(i,j) with i,j Є K - Only the best path among those having been identified by all the ants will phéremonted. In this case a fixed amount will be deposited. With more than 100 characters and a performance of less than 600 iterations, the average of the results is illustrated as follows: Ant 30 22 77 Parameters α= 0.7, β= 0,5, γ= 0,5, τ=0,5, Cevap=65 α= 0.1, β= 0,7, γ= 1,0, τ=0,2, Cevap=20 α= 0.8, β= 0,5, γ= 1,0, τ=0,4, Cevap=41 Key Match Car. Max Avg Simple 18 13,51 Middle 15 11,3 Difficu lt 11 9,41 c. Elitist ants: Proposed by [16] the idea of the algorithm is to grant an additional amount of pheromone on the arcs involved in an interesting path. In other words, allowing some called elitist ants to trace these arcs so that they remain rich in pheromone and invite other ants to pass through. In practice, the key to giving a clearer text is kept in view in the next iterations and change only a few characters, just to get a better result, otherwise return to the previous key. Under the same conditions, an average of unigram and bigram is illustrated by the following scheme: The amount of pheromone granted is defined by the relation: 25 20 Car-c orrec t τ(i,j=τ(i,j)+Bonus/ Σcout(i,j) with i,j Є K The experiment shows that the amount is significantly higher close to the value of pheromone deposited by other ants. The best results obtained with a number of iterations close to 850, a population of 60 ants and a text of 180 characters is illustrated in following table: Key α=0.8 β= 0.5 γ=-1.0 τ =0.2 Evap=1 BI=40 BS=200 Cvp=55 difficult F-reelles F-virtuelles elitistes A -Génétique 10 5 0 100 In this algorithm, only the change of basic parameters can give effective results. Owever, various alternatives were tested, including the manner of selection of individuals (elitist, per tournament, etc...) or replacement within the population base (stationary, elitist, etc...). 100 0 100 120 120 Simple 23 18,31 Middle 17 11,3 Difficult 10 9,04 7.3 Synthesis: As it was mentioned above, each algorithm can give best results under specific conditions, including the choice of initial parameters. A summary of the various algorithms with the same parameters, a text of 120 characters and a number of iterations close to 550, gave the following results: Char Corrects 25 20 200 300 400 500 Each test was started by a random choice of keys. Each iteration of the various algorithms gives birth to one or more new keys obtained by changing the order of a few key characters available in the previous iteration. Thus, the number of keys increases as manipulated as the treatment effect which consume more resources. The number of keys generated by different algorithms is shown in the following figure: 10000 8000 6000 F-reelles F-virtuelles elitistes A-Génétique 4000 2000 0 15 10 100 200 300 400 500 600 F-reelles F-virtuelles elitistes A ,Genetique 5 message size 0 100 200 300 400 message size 500 600 600 message size Generated keys Gen=80, Np=40, Pc=1 Gen=360, Np=60, Pc=1 Gen=120, Np=61, Pc=1 300 600 F-reelles F-virtuelles elitistes A-Génétiques The average results obtained is shown in the following table: key 500 300 200 Parameters 400 The treatment was carried on a dual processor 2.0. The execution time for the various algorithms was as follows: Match Char 13 Match Car. Max Avg 300 message size d. Genetic algrithm: Message size 200 Time(ms) Parameters 15 8. Conclusion [6] A.J. Clark, "Optimisation Heuristics for Cryptology", PhD thesis, Queensland University of Technology, 1998. In this paper, we presented results of comparison of some algorithms belonging to the class of heuristics. The field of exploration is a set of texts encrypted by techniques of substitution and transposition of middle class and made difficult by various modern cryptosystems. [7] D. Bahler and J. King, "An implementation of probabilistic relaxation in the cryptanalysis of simple substitution systems", Cryptologia, vol.16(3),1992. The first performance test is to control most parameters of the algorithms used, this was achieved by transferring the results of some algorithms to be integrated as data for other, which helped to improve these results in a distinct manner. [8] M.Faisal Uddin, Amr M. Youssef, "Life Technique for the Cryptanalysis of Simple Substitution Ciphers", IEEE CCECE/CCGEI, Ottawa, May 2006. The second performance is achieving results equivalent to those present in literature with a minimum of conditions, including short texts and a fairly reasonable processing time. [9] A. Dimovski, D. Gligoroski, "Alphabetic substitution cipher using a parallel genetic algorithm domain cooperation through SCOPES PROJECT", Ohrid, Maccedonia,2003 The synthesis of the tests proved that the ACO algorithms can yield better results than those generated by genetic algorithms, however this last, are more efficient in terms of resource consumption and can compete to decipher texts with significant volume. [10] Zim, Herbert Spencer. Codes and secret writing (abridged edition). Scholastic Book Services, fourth printing, 1962 [11] Beker, Henry; Piper, Fred (1982). Cipher Systems: The Protection of Communications The major problem in this kind of research is the existence of various statistical tables inspired several languages Corpus and the use of which diversifies the results in a distinct manner. However, and for satisfaction in this area, a statistical study and classification of these tables according to the specific texts to decipher which is essential. 9. References [1] A Malapert, G. Jeantet, « Métaheuristique d’un ordonnancement Juste à temps », Université Pierre et Marie Curie,2005 [2] S. Peleg and A. Rosenfeld, "Breaking substitution ciphers using a relaxation algorithm," Communications of the ACM, vol. 22(11), 1979 [3] J. Carrol and S. Martin, "The automated cryptanalysis of substitution ciphers," Cryptologia, vol. 10(4), 1986. [4]W. S. Forsyth and R. Safavi-Naini, cryptanalysis of substitution ciphers", Cryptologia, vol.17(4), 1993. [5] R. Spillman, M. Janssen, B. Nelson and M. Kepner, "Use of a genetic algorithm in the cryptanalysis of simple substitution ciphers," Cryptologia, vol.17(1), 1993. [12] Lewand, Robert (2000). Cryptological Mathematics. The Mathematical Association of America [13] Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002). Exploring Natural Language. Working with the British Component of the International Corpus of English. [14] Christophe RITZENTHALER, "The cryptology Course", Université de Marseille, 2006. [15] AJ Bagnall. Les applications des algorithmes génétiques en cryptanalyse, 1996. [16]Dorigo M., V. Maniezzo, A. Colorni , "Ant System:Optimization by a colony of cooperating agents",IEEETransactions on Systems, Man, and Cybernetics-Part B,26(1):29-41
© Copyright 2025 Paperzz