gnunet-svn
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-SVN] [taler-schemafuzz] branch master updated: last version of i


From: gnunet
Subject: [GNUnet-SVN] [taler-schemafuzz] branch master updated: last version of intership report
Date: Wed, 05 Sep 2018 12:46:16 +0200

This is an automated email from the git hooks/post-receive script.

erwan-ulrich pushed a commit to branch master
in repository schemafuzz.

The following commit(s) were added to refs/heads/master by this push:
     new 076defc  last version of intership report
076defc is described below

commit 076defc8dca31937efabe8319fb257325c543aca
Author: Feideus <address@hidden>
AuthorDate: Wed Sep 5 12:46:12 2018 +0200

    last version of intership report
---
 docs/Documentation.pdf   | Bin 1001428 -> 1088803 bytes
 docs/Documentation.tex   | 253 ++++++++++++++++++++++++-----------------------
 docs/EndToEndDiagram.pdf | Bin 38035 -> 38160 bytes
 docs/EndToEndDiagram.tex |   4 +-
 4 files changed, 130 insertions(+), 127 deletions(-)

diff --git a/docs/Documentation.pdf b/docs/Documentation.pdf
index 52124b9..95b945b 100644
Binary files a/docs/Documentation.pdf and b/docs/Documentation.pdf differ
diff --git a/docs/Documentation.tex b/docs/Documentation.tex
index 1b06161..2f28466 100644
--- a/docs/Documentation.tex
+++ b/docs/Documentation.tex
@@ -1,6 +1,5 @@
 \documentclass{article}
 \usepackage[utf8]{inputenc}
-%\usepackage[document]{ragged2e}
 \usepackage{hyperref}
 \usepackage{tikz}
 \usepackage{pifont}
@@ -14,7 +13,7 @@
 \usepackage{subcaption}
 \usetikzlibrary{shapes.arrows,chains}
 \usepackage[english]{babel}
-
+\newcommand\tab[1][1cm]{\hspace*{#1}}
 % Definition of \maketitle
 \makeatletter         
 address@hidden
@@ -29,8 +28,7 @@
 \end{minipage}
 \end{figure}
 
-\bigskip
-
+\vspace*{\fill}
 \begin{center}
 {\Huge \bfseries \sffamily address@hidden }\\[4ex] 
 {\Large  address@hidden 
@@ -47,90 +45,103 @@
 \maketitle
 
 \vspace*{\fill}
-\underline{Supervisor}: Christian Grothoff}    \\
-\underline{Promotion}: 2017-2018
+\underline{Supervisors}: Christian Grothoff, Pierre Boudes     \\*
+\tab[0.4cm] \underline{Promotion}: 2017-2018
 \clearpage
 
 \tableofcontents
+\clearpage
 
+       \abstract
+The concept of fuzz testing is the result of submitting unexpected data to a 
software. It is used in software development in order to reveal errors or 
approximations in the source code of a project. Several approaches exist in 
order to pursue this goal such as focusing a specific type of data or using 
concepts that subscribe in the field of machine learning. This project is an 
implementation of the fuzz testing strategy focusing on the management of a 
database's content.  \\*
+The purpose of this documentation is to provide an introduction on the concept 
of fuzz testing and to describe precisely an example of implementation. \\*
+It is hoped that it will give more insights on the philosophy and the drive 
that motivated this work as well as a description of the design of the 
SchemaFuzz tool for its users and contributers.
 
+\bigskip
+
+Le concept du Fuzz testing est le resultat de la soumission de données 
inapropriées à un programme cible. Ce concept est utilisé au cours du 
developement dans le but de reveler les erreurs d'implementation ainsi que les 
failles de securité dans le code source d'un projet informatique. Plusieurs 
approches differentes peuvent permettrent d'atteindre ce but telles que la 
concentration sur un type particuler de données ou encore l'utilisation de 
concepts externes comme l'apprentissage machin [...]
+Le but de cette documentation est de fournir une introduction au concept du 
fuzz testing et de decrire précisement un exemple d'implémentation. \\*
+Nous nourissons l'espoir qu'elle apportera de plus amples détails sur la 
philisophie et sur les motivation qui soutiennent ce projet ainsi que sur le 
désign de l'outil SchemaFuzz pour ses utilisateurs et contributeurs. 
+
+       
        \section{Introduction} 
        
-SchemaFuzz is a free software command line tool incorporated inside the Gnu 
Taler package 
-which is a free software electronic payment system providing anonymity for 
customers.
-The main goal of this project is to provide an efficient debugging tool that 
uses a "fuzzing" strategy oriented on databases.  
-Where a traditional fuzzer would send malformed input to a program, SchemaFuzz 
modifies the content of a database to test that program's behavior when 
stumbling on such unexpected data. \\*
-Obviously, this tool is meant to be used as a mean of debugging as the goal is 
to pop bugs or put into light the security breaches that the code may contain 
regarding the retrieving, usage and saving of a database's content.
-As this tool is being developed as a master's thesis project, its current 
state is far from being finished and there are many options and optimizations 
that deserve to be implemented that are not yet available.
-
+This project is meant to provide an secure development tool that uses a 
database oriented "fuzzing" strategy.  
+Where a traditional fuzzer would send malformed input to a program, SchemaFuzz 
modifies the content of a database to test that program's behavior when 
stumbling on  unexpected data. \\*
+This tool's objective is to bring up the bugs and security breaches that the 
code may contain regarding the retrieving and usage  of a database's content by 
stressing it with corrupted data.
+This tool is still at an alpha state; its development will is still in 
progress and will be the subject of further publications which will lead to 
updates in this document. 
        \clearpage
 
        
        \section{Context and Perimeter} 
                \subsection{Context}
-SchemaFuzz's development enrolls in the global dynamic of the past decades 
regarding Internet  that sustain great efforts to make it a more fluid, 
pleasant but more importantly a safer space.
 
-It uses the principle of "fuzz testing" or "fuzzing" to help find out which 
are the weak code paths of one's project. 
-                               \begin{quotation}
-Traditional fuzzing is defined as "an automated software testing technique 
that involves providing invalid, unexpected, or random data as inputs to a 
computer program". \end{quotation}\cite{fuzzing}         
+SchemaFuzz uses the principle of "fuzz testing" or "fuzzing" to help find out 
which are the weak code paths of one's project. It is usually defined as: 
 
-This quote is   well illustrated by the following example :
+\begin{quotation} "an automated software testing technique that involves 
providing invalid, unexpected, or random data as inputs to a computer program". 
\end{quotation}\cite{fuzzing}         
+       
+This quote is well illustrated by the following example :
                                \begin{quotation}
-Lets consider an integer in a program, which stores the result of a user's 
choice between 3 questions. When the user picks one, the choice will be 0, 1 or 
2. Which makes three practical cases. But what if we transmit 3, or 255 ? We 
can, because integers are stored a static size variable. If the default switch 
case hasn't been implemented securely, the program may crash and lead to 
"classical" security issues: (un)exploitable buffer overflows, DoS, ... 
+Lets consider an integer in a program, which stores the result of a user's 
choice between 3 questions. When the user picks one, the choice will be 0, 1 or 
2. Which makes three practical cases. But what if we transmit 3, or 255? We 
can, because integers are stored in a static size variable. If the default case 
hasn't been implemented securely, the program may crash and lead to any kind of 
security issue such as: (un)exploitable buffer overflows, DoS, ... 
                                \end{quotation}
 
+Fuzz testing has been used since the 90's to test one's software robustness.
+The first generation of fuzzing tools was composed of primitive input 
injections which was described with the "infinite monkey theorem" analogy.
 It is divided in severals categories that each focus on a specific type of 
input.
  
-UI fuzzing focuses on button sequences and more generically any kind of user 
input during the execution of a program. The above example falls into this 
category.
-This principle had already successfully been used in existing fuzzing tool 
such as  "MonkeyFuzz".
-File format fuzzing generates multiple malformed samples, and opens them 
sequentially.
+UI fuzzing focuses on button sequences and more generically any kind of user 
input during the execution of a program. This principle has already been 
successfully used in existing fuzzing tools such as  "MonkeyFuzz".
 
 Certificate fuzzing is another interesting fuzzing approach that has emerged 
especially after it was introduced by \cite{Certif} in $Development$ $of$ 
$Intelligent$ $Digital$ $Certificate$ $Fuzzer$ $Tool$
 
 However, SchemaFuzz is a database oriented fuzzer. This means that it focuses 
on triggering unexpected behavior related to the usage of a external database 
content   
 
 This tool is meant to help developers, maintainers and more generically anyone 
that makes use of data coming from a database under his influence in their 
task. A good way to sum up the effect of this tool is to compare it with an 
"cyber attack simulator".
-This means that the idea behind it is to emulate the damage that an attacker 
may cause subtly or not to a database he illegally gained privileges on. This 
might in theory go from a simple boolean flip (subtle modifications) to 
removing/adding content to purely and simply destroying or erasing all the 
content of the database.
-SchemaFuzz focuses on the first part : modification of the content of the 
database by single small modification that may or may not overlap. These 
modifications may be   aggressive of   subtle.
+This means that the idea behind it is to emulate the damage that an attacker 
may cause subtly or not to a database he possesses privileges on. This might in 
theory go from a simple boolean flip (subtle modifications) to removing/adding 
content to purely and simply destroying or erasing all the content of the 
database.
+SchemaFuzz focuses on the first part: modification of the content of the 
database by sequential modifications that may or may not overlap. These 
modifications may be   aggressive or subtle.
 It is interesting to point out that this last point also qualifies SchemaFuzz 
as a good "database structural flaw detector".
 That is to say that errors typically triggered by a poor management of a 
database (wrong data type usage, incoherence between database structure and use 
of the content etc ...) might also appear clearly during the execution.
-For a more in depth description of different fuzzing approaches and , one can 
refer to $A$ $systematic$ $review$ $of$ $fuzzing$ $techniques$ \cite{Systematic}
+For a more in depth description of different fuzzing approaches, one can refer 
to $A$ $systematic$ $review$ $of$ $fuzzing$ $techniques$ \cite{Systematic}
+
                \subsection{Perimeter}
-This tool implement's some of the SchemaSpy tool's source code. More 
precisely, it uses the portion of the code that detect and stores the target 
database's structure.
-The main goal of this project is to build on top of this piece of existing 
code the functionalities required to test the usage of the database content by 
any kind of utility.                 
+This tool implements some of the SchemaSpy tool's source code. More precisely, 
it uses the portion of the code that detects and stores the target database's 
structure as well the data type for each of the fields that it contains. This 
project's design revolves around the adaptation of this existing code. 
Developing the functionalities required to test the usage of the database 
content by any kind of utility is the main stake of SchemaFuzz.            
 The resulting software will generate a group of human readable reports on each 
modification that was performed.                
                \begin{figure} [h!]
                \includegraphics[width=\textwidth]{codeOriginDiagram.pdf}
-               \caption{Shows the nature of the code for every distinct 
component. The slice size is a rough estimation.}
+               \caption{Shows the nature of the code for every component. The 
slice size is a rough estimation.}
                \end{figure}
                \subsection{When to use it}
-SchemaFuzz is a   useful tool for anyone trying secure a piece of software 
that uses database resources. The target software should be GDB(introduce GDB) 
compatible and the DBMS(introduce acronym) has to grant access to the target 
database through credentials passed as argument to this tool.
-
----It is   strongly advice to use a copy of the target database rather than on 
the production material. Doing so may result in the database being corrupted 
and not usable for any useful mean.
+SchemaFuzz is a useful tool for developers trying secure a piece of software 
that uses database resources. In its current state, Schemafuzz is only 
compatible with C language and more specifically with GDB compatible softwares. 
The DBMS (which is the SQL interpretor used by the target software) has to 
grant access to the target database through credentials passed as argument to 
this tool.
 
+Since this tool does not have a rollback feature yet to restore the first 
state of the target database, the database being fuzzed will end up in a 
randomly modified state that may not be suitable for production. Thus, it is 
strongly advised to use a copy of the target database rather than the 
production material
                \clearpage
 
        \section{Design}
                \subsection{Generic explanation}
-SchemaFuzz implementation is based on some bits of the SchemaSpy project 
source code.
-The majority of this project is built on top of this already existing code and 
is organized as follows :
+The code for the core of the program is organized as follows:
                \begin{itemize}
-               \item{mutation/data-set used as a way to store the 
inputs,outputs and other interesting data from the modification that was 
performed on the target database}
-               \item{the mutation Tree, used to store the mutations coherently}
-               \item{an analyzer that scores the mutations to influence the 
paths that will be explored afterwards}
+               \item{Mutation/data-set used as a way to store the inputs, 
outputs and other interesting data from the modification that was performed on 
the target database}
+               \item{The mutation tree, used to store the mutations coherently}
+               \item{An analyzer that scores the mutations to influence the 
paths that will be explored afterwards}
                \end{itemize}
 This organization will be detailed and discussed in the following sections.
 
+\begin{figure} [h!]
+\includegraphics[scale=0.7]{StructuralDiagram.pdf}
+\caption{Structural organization of the different components}
+\end{figure}
+
                 \clearpage
 \begin{figure} [h!]
 \centering
 \includepdf[pages={2},scale=0.85]{EndToEndDiagram.pdf}
-\caption{Main Loop}
+\caption{Main Loop work flow chart}
 \end{figure}
                 \clearpage              
                 
+                
+                
                \subsection{SchemaSpy legacy/meta data extraction}
-SchemaSpy source code has provided the meta data extraction routine. The only 
job of this routine is to initialize the connection to the database and 
retrieve its meta data at the   beginning of the execution (before any actual 
SchemaFuzz code is run). These meta data include data types, table and table 
column names, views and foreign/primary key constraints. Having this pool of 
meta data under the shape of Java objects allows the main program to properly 
frame what the possibilities are [...]
+The only job of the meta data extraction routine is to initialize the 
connection to the database and retrieve its meta data at the   beginning of the 
execution (before any actual SchemaFuzz code is run). These meta data include 
data types, table and table column names, views and foreign/primary key 
constraints. Having this pool of meta data in a Java object shape allows the 
main program to properly mark off what the possibilities are in terms of 
modifications as well as dealing with the  [...]
 
 \begin{figure} [h!]
 \centering
@@ -140,13 +151,13 @@ SchemaSpy source code has provided the meta data 
extraction routine. The only jo
 
 In order to do that, the user shall provide this set of mandatory database 
related arguments
                        \begin{itemize}
-                               \item The driver to the corresponding database 
RDBMS (only support PostGres at the moment)
+                               \item The driver to the corresponding database 
DBMS (only support Postgre at the moment)
                                \item The credentials to be used to access the 
database.
-                               \item The name of the database (duh)
+                               \item The name of the database
                        \end{itemize}
                \subsection{SchemaFuzz Core}            
-                       \subsubsection{Constrains}
-The target database often contains constraints on one or several tables. These 
constraints have to be taken into account in the process of fabricating 
mutations as most of the time they restrict the possible values that the 
pointed field can take. These restrictions can take the shape of a \underline 
{Not Null} constraint, \underline{Check} constraint, {Foreign key} constraint 
(value has to exist in some other table's field) or \underline{Primary key} 
constraint (no doublets of value all [...]
+                       \subsubsection{Constraints}
+The target database often contains constraints on one or several tables. These 
constraints have to be taken into account in the process of fabricating 
mutations as most of the time they restrict the possible values that the 
pointed field can take. These restrictions can take the shape of a \underline 
{Not Null} constraint, \underline{Check} constraint, \underline{Foreign key} 
constraint (value has to exist in some other table's field) or 
\underline{Primary key} constraint (no doublets of [...]
 \bigskip
 
 \begin{figure} [h!]
@@ -157,89 +168,81 @@ The target database often contains constraints on one or 
several tables. These c
 
 \bigskip
 
-The last two ones are the problematic ones. They imply specific work before 
applying any mutations to make sure that the value respect all the 
restrictions. before doing anything else after the meta data extraction is 
done, SchemaFuzz performs an update of all the existing constraints on the 
database to add the CASCADE clause. This allows the values bonded by a foreign 
key constraints to take effect. This update reverts to take the constraints 
back to their initial state before the progr [...]
-                               \paragraph{Primary key constraints (PKC)} :
+The last two ones are the problematic ones. They imply specific work before 
applying any mutations to make sure that the value respect all the 
restrictions. before doing anything else after the meta data extraction is 
done, SchemaFuzz performs an update of all the existing constraints on the 
database to add the CASCADE clause. This allows the values bonded by a foreign 
key constraints to take effect. The tool reverts these updates to take the 
constraints back to their initial state befor [...]
+                               \paragraph{Primary key constraints (PKC):}
 The primary key constraints require an extra DB query that checks the 
existence of the value in the column. If the value already exists (the query's 
result is not empty), the mutation will be dropped before being executed.
-                               \paragraph{Foreign key constraints (FKC)} :
-The foreignKey constraint is the trickiest one. Its inherent nature bonds two 
values of different table column where the value being referenced is called the 
father, and the referencing field, the child. To be precise, in order to change 
one of the two values, the other has to be changed accordingly in the same 
statement.SchemaFuzz uses the power of the CASCADE clause to make the change 
possible. This clause allows the DBMS to automatically change the value of the 
child if the father has [...]
+                               \paragraph{Foreign key constraints (FKC):}
+The foreign Key constraint is the trickiest one. Its inherent nature bonds two 
values of different table columns where the value being referenced is called 
the father, and the referencing field, the child. In order to change one of the 
two values, the other has to be changed accordingly in the same transaction. 
SchemaFuzz uses the semantics of the CASCADE clause to make the change 
possible. This clause allows the DBMS to automatically change the value of the 
child if the father has been  [...]
 This mechanic allows to change any of the bounded values by changing the 
father's value.
-To do so, the software has a way to transfer the mutation from a child to its 
parent (called the mutationTransfer).
+To do so, the software has a way to transfer the mutation from a child to its 
parent.
 
                                
                        \subsubsection{Mutations}
-                               \paragraph{What is a Mutation}
-A mutation is a Java object that bundles all the informations that are used to 
perform a modification in the database. Every is linked to its parent and 
inherits some of his parent's data. In the case of a follow up mutation the 
child inherits the the database row that was his parent's target.Therefore the 
initial state (state before the injection of the modification) of its target is 
exactly the final state (state after injection of the modification) of his 
parent's target. A mutation i [...]
-It also holds the informations concerning the result of the injection in the 
shape of a data vector. This data vector is then used to perform a clustering 
calculus to determine the "uniqueness" of the mutation. This value is also 
stored inside the mutation object and is used as the weight of this mutation in 
the tree.
+A mutation is a Java object that bundles all the informations that are used to 
perform a modification in the database. Every mutation is linked to its parent 
and inherits some of his parent's data. In the case of a follow up mutation the 
child inherits the database row that was his parent's target. Therefore the 
initial state (state before the injection of the modification) of its target is 
exactly the final state (state after injection of the modification) of his 
parent's target. A muta [...]
+It also holds the information concerning the result of the injection in the 
shape of a data vector. This data vector is then used to perform a clustering 
calculus to determine the "uniqueness" of the mutation. This value is also 
stored inside the mutation object and is used as the weight of this mutation in 
the tree.
+
+\bigskip
 
 \begin{figure} [h!]
 \centering
-\includegraphics[width=\textwidth]{MutationClassDiagram-1.pdf}
+\includegraphics[scale=0.9]{MutationClassDiagram-1.pdf}
 \caption{Structure of a Mutation}
 \end{figure}
 
-\bigskip                               
+\clearpage
                                
 A branch is a succession of mutation that share the same database row as their 
modification target.
-The heuristics determining the next mutation's modification are still   
primitive and will be thinly justed in futures versions.                        
                                        
+The heuristics determining the next mutation's modification are still 
primitive and will be thinly adjusted in futures versions.                      
                                          
                                \paragraph{Creating malformed data} 
-                               %%ne veux rien dire.
-As the goal of running this tool is to submit unexpected or invalid data to 
the target software it is necessary to understand what t
-Fuzzing a complex type such a timestamps variables has nothing to do with 
fuzzing a trivial boolean. In practice, a significant part o
-and this matter could absolutely be the subject of a more abstract work. We 
focused here on a   simple approach (as a first step).
-After retrieving the current row being fuzzed (may it be a new row or a 
previously fuzzed row), the algorithm explores the different
-The algorithm then builds the possible modification for each of the fields for 
the current row.
-At the moment, the supported types are : % add a list of the supported types.
-More primitives types will be added in the future.
-The possible modifications that this tool can produce at the moment are : \\ % 
add complete list of the modifications that CAN be gener$ 
-                               Int Types:
+As the goal of running SchemaFuzz is to submit unexpected or invalid data to 
the target software, it is necessary to understand that fuzzing a complex type 
such a timestamps variables has nothing to do with fuzzing a trivial boolean. 
In practice, a significant part of this matter could absolutely be the subject 
of a more abstract work. We focused here on a simple approach (as a first step).
+After retrieving the current row being fuzzed (may it be a new row or a 
previously fuzzed row), the main loop explores the different modification 
possibilities.
+The main loop then builds the possible modification for each of the fields for 
the current row. The possible modifications that this tool can produce at the 
moment are : \\ % add complete list of the modifications that CAN be gener$ 
+                               Boolean types:
+                               \begin{itemize}                                 
        
+                                       \item Swapping the existing value (F 
$\mapsto$ T OR T $\mapsto$ F)
+                               \end{itemize}
+                               Integer types:
                                \begin{itemize}
                
-                                       \item Extreme values ($0 \mapsto 
\texttt{MAXVALUE}$ etc...)
-                                       \item Random value 
($0<\texttt{value}<\texttt{MAXVALUE}$ etc...)
+                                       \item Extreme values ($0, 
\texttt{MAXVALUE}$ etc.)
+                                       \item Random value 
($0<\texttt{value}<\texttt{MAXVALUE}$ etc.)
                                        \item Increment/Decrement the existing 
value ($332 \mapsto 333$ OR $332 \mapsto 331$)
                                \end{itemize}
-                               String Types:
+                               String types:
                                \begin{itemize}
-                                       \item Change string to "aaa" ("Mount 
Everest" $\mapsto$ "aaa")
-                                       \item Increment/Decrement ASCII 
character at a random position in the string ("Mount Everest" $\mapsto$ "Mount 
Fverest")
-                               \end{itemize}
-                                       Boolean Types:
-                               \begin{itemize}                                 
        
-                                       \item Swapping the existing value (F 
$\mapsto$ T OR T $\mapsto$ F)
+                                       \item Change string to ``aaa" (``Mount 
Everest" $\mapsto$ ``aaa")
+                                       \item Increment/Decrement ASCII 
character at a random position in the string (``Mount Everest" $\mapsto$ 
``Mount Fverest")
                                \end{itemize}
                                        Date Types: (implemented but not fully 
functional)                      
                                \begin{itemize}
                                        \item Increment/Decrement date by 1 
day/minutes depending on the precision of the date
-                                       \item Set date to $00/00/0000$ 
+                                       \item Set date to $01/01/1970$ 
                                \end{itemize}
-These "abnormal" values might in fact be totally legit in some cases. in that 
case the analyzer 
-will rank the mutation rather poorly, which will lead to this tree path not 
being   likely to be developed further more.
+These "abnormal" values might in fact be totally legit in some cases. In that 
case the analyzer will rank the mutation rather poorly, which will lead to this 
tree path not being likely to be developed further.
                                \\*
                                \paragraph{SQL handling}
-All the SQL statements are generated within the code. This means that the data 
concerning the current and future state of the mutations have to be   precise. 
Otherwise, the SQL statement is   likely to fail. Sadly, since SchemaFuzz only 
supports postgreSQL, the implemented syntax follow the one of postgres
-DBMS. This is already a   big axis for future improvements and will be 
detailed in the dedicated section.
-The statement is built to target the row as precisely as possible, meaning 
that it uses all of the non fuzzed values from the row to avoid updating other 
row accidentally. Only the types that can possibly be fuzzed will be used in 
the building of the SQL statement. Since this part of the code is   delicate in 
the sense that it highly depends on an arbitrary large pool of variables from 
various types it is a good bug provider. 
+All the SQL statements are generated within the code. This means that the data 
concerning the current and future state of the mutations have to be   precise. 
Otherwise, the SQL statement is   likely to fail. Sadly, since SchemaFuzz only 
supports PostgreSQL, the implemented syntax follow the one of Postgre
+DBMS. The statement is built to target the row as precisely as possible, 
meaning that it uses all of the non fuzzed values from the row to avoid 
updating other row accidentally. Only the types that can possibly be fuzzed 
will be used in the building of the SQL statement. Since this part of the code 
is delicate in the sense that it highly depends on an arbitrary large pool of 
variables from various types it is has a tendency of crashing. 
                                
-                               \paragraph{Injecting} :
-The injection process sends the built statement to the DBMS so that the 
modification can be operated. After the execution of the query, depending of 
the output of the injection (one modification, several modifications, transfer) 
informations are updated so that they can match the database state after the 
modification. If the modification failed, no trace of this mutation is kept, it 
is erased and running goes on like nothing happened.                            
     
-                               \paragraph{Special Case(MutationTransfer)} :
+                               \paragraph{Injecting:}
+The injection process sends the built statement to the DBMS so that the 
modification can be operated. After the execution of the query, depending of 
the output of the injection (one modification, several modifications, transfer) 
informations are updated so that they can match the database state after the 
modification. If the modification failed, no trace of this mutation is kept, it 
is erased and the execution goes on like nothing happened.                      
     
+                               \paragraph{Special case (mutationTransfer):}
 The mutation transfer is a special case of a modification being applied to the 
database.
 It is triggered when the value that was supposed to be fuzzed is under the 
influence of a FKC as the child.
-In the case a FKC (In CASCADE mode), only the father can be changed, which 
also triggers the same modification on all of his children. The algorithm then 
"transfers" the modification from the original mutation to its father.
+In the case a FKC (in CASCADE mode), only the father can be changed, which 
also triggers the same modification on all of his children. The algorithm then 
"transfers" the modification from the original mutation to its father.
 After injecting the transfered mutation, the children mutation is modified but 
the modification cascades on some parts of the database that was not meant to 
be changed.
 Hopefully, this does not impact the life of the algorithm until this mutation 
is reverted (see next paragraph).
-                               \paragraph{Do/Undo routine} :
+                               \paragraph{Do/Undo routine:}
 The Do/Undo mechanism is at the center of this software. Its behavior is 
crucial for the execution and will have a strong impact on the coherence of the 
data nested in the code or inside the target database throughout the runtime.
 This mechanism allows the algorithm to revert a previous mutation or, if 
necessary inject it one more time. 
 Undoing a mutation applies the exact opposite modification that was originally 
applied to the database ending up in recovering the same database state as 
before the mutation was injected.
 Reverting mutations is the key to flawlessly shifting the current position in 
the mutation tree.
 The case of the transfered mutation is no exception to this. In this case, the 
mutation applied changes on an unknown number of fields in the database. But, 
the FKC still bounds all the children to their father at this point (this is 
always the case unless this software is not used as intended).  
-Changing the father's field value back to its original state will splash the 
original values back on all the children.
+Changing the father's field value back to its original state will propagate 
the original values back on all the children.
 This mechanism might trigger failing mutations in some cases (usually 
mutations following a transfer). This issue will be addressed in the known 
issues section. 
 
                                \subsubsection{Choosing pattern}
-For each iteration of the main loop, a modification has to be picked up as the 
next step in the fuzzing process. This is done by considering the current state 
of the tree.
-Three parallel code paths can be triggered from this point.
+For each iteration of the main loop, a modification has to be picked up as the 
next step in the fuzzing process. This is done by considering the current state 
of the tree. Three parallel code paths can be triggered from this point.
                                \begin{itemize}
                                \item{Continue on the current branch of the 
tree (triggered if the last mutation scored better than its parent)}
                                \item{Pick an existing branch in the tree and 
grow it (triggered if the last mutation scored worse than its parent on a 50/50 
chance with the next bullet)}
@@ -249,20 +252,19 @@ Three parallel code paths can be triggered from this 
point.
                                \bigskip
                        \begin{figure}[ht!]
                        
\includegraphics[width=\textwidth]{pickingPaternDiagram.pdf}
-                       \caption{Diagram that shows, for each iteration of the 
main loop, the mutation is }
+                       \caption{Diagram that shows, for each iteration of the 
main loop, how the next modification is chosen}
                        \end{figure}                    
                
                                                \bigskip
                        \subsubsection{Tree Based data structure}
-All the mutations that are injected at least once in the course of the 
execution of this software will be stored properly in a tree data structure. 
Having such a data structure makes parent-children relations between mutations 
possible. The tree follows the traditional definition of the a n-ary 
algorithmic tree.
+All the mutations that are injected at least once in the course of the 
execution of this software are stored properly in a tree data structure. Having 
such a data structure makes parent-children relations between mutations 
possible. The tree follows the traditional definition of the a n-ary 
algorithmic tree.
 It is made of nodes (mutations) including a root (first mutation to be 
processed on a field selected randomly in the database)  
 Each node has a number of children that depends on the ranking its mutation 
and the number of potential modifications that it can perform.
                                \paragraph{Weight}
-Weighting the nodes is an important part of the runtime. Each mutation has a 
weight that is equal to the analyzer's output. This value reflects the 
mutation's value. If it had an interesting impact on the target program 
behavior (if it triggered new bugs or uncommon code paths) than this value is 
high and vice-versa. The weight is then used as a mean of determining the 
upcoming modification. The chance that a mutation gets a child is directly 
proportional to its weight.
-This value currently isn't biased by any other parameter, but this might 
change in the future.  
-                               \paragraph{Path} 
-Since the weighting of the mutation allows to go back to previous more 
interesting mutations, 
-there is a need for a path finder mechanism. In practice, this routine 
resolves the chain of node that separates two nodes in the tree. This is done 
by, from both nodes, going in the direction of the root until a common ancestor 
is found. Fusing the lists of both chains results in creating the full path 
between the two nodes. The path is then used when the main loop goes through 
the undo mechanism. Undoing from mutation A to mutation B is implemented as 
undoing every mutation between A and B
+Weighting the nodes is an important part of the runtime. Each mutation has a 
weight that is equal to the ranking the analyzer returns as an output. This 
value reflects the mutation's value. If it had an interesting impact on the 
target program behavior (if it triggered new bugs or uncommon code paths) than 
this value is high. The weight is then used as a mean of determining the 
upcoming modification. The chance that a mutation gets a child is directly 
proportional to its weight.
+
+                               \paragraph{Path}:
+Since the weighting of the mutation allows to go back to previous more 
interesting mutations, there is a need for a path finding mechanism. In 
practice, this routine determines the chain of node that separates two nodes in 
the tree. This is done by, from both nodes, going in the direction of the root 
until a common ancestor is found. Fusing the lists of both chains results in 
creating the full path between the two nodes. The path is then used when the 
main loop goes through the undo mech [...]
 
 \bigskip
 
@@ -274,24 +276,24 @@ there is a need for a path finder mechanism. In practice, 
this routine resolves
 
 \bigskip
                        \subsubsection{The analyzer}
-Analyzing the output of the target program is another critical part of 
SchemaFuzz. The analyzer parses in the stack trace of the target software's 
execution in order measure how interesting the output of the execution was. 
Since crashes and unexpected behavior from the target software is what the tool 
is triggering, it is the main criteria of a valuable mutation. A stack trace is 
a text block structured to present all the information related to a crash 
during a software's execution. The  [...]
+Analyzing the output of the target program is another critical part of 
SchemaFuzz. The analyzer parses in the stack trace of the target software's 
execution in order measure how interesting the output of the execution was. 
Since crashes and unexpected behavior from the target software is what the tool 
is triggering, it is the main criteria of a valuable mutation. A stack trace is 
a text block structured to present all the information related to a crash 
during a software's execution. The  [...]
                                \paragraph{Stack Trace Parser}
-The stack trace parser is a separate Bash script that processes stack traces 
generated by the GDB C language debugger and stores all the relevant 
informations (function's name, line number, file name) into a Java object. The 
parser also generates as a secondary job a human readable file for each 
mutation that synthesizes the stack trace values as well as additional 
interesting information useful for other mechanisms (that also require 
parsing). These additional informations include the p [...]
+The stack trace parser is a separate Bash script that processes stack traces 
generated by the GDB C language debugger and stores all the relevant 
informations (function's name, line number, file name) into a Java object. The 
parser also generates a human readable file for each mutation that synthesizes 
the stack trace values as well as additional interesting information useful for 
other mechanisms (that also require parsing). These additional informations 
include the path from root to mu [...]
                                \paragraph{Hashing}
-The clustering algorithm used to compute the crash rank take a triplet of 
numerical values as an input.Therefore, the stack trace of a mutation has to be 
hashed into a triplet of numerical values. This set of value is used as a 
representation of the original stack trace object.
+The clustering algorithm takes a triplet of numerical values as an input. 
Therefore, the stack trace of a mutation has to be hashed into a triplet of 
numerical values. This set of value is used as a representation of the original 
stack trace object.
 Hashing is usually defined as follows : 
                                \begin{quotation}
 "A hash value (or simply hash), also called a message digest, is a number 
generated from a string of text. The hash is substantially smaller than the 
text itself, and is generated by a formula in such a way that it is extremely 
unlikely that some other text will produce the same hash value."
-                               \end{quotation}
+                               \end{quotation} \cite{Hashing} \\*
                                
 In the present case, we used a different approach. Since proximity between two 
stack traces is the key to a relevant ranking, it is mandatory to have a 
hashing function that preserves the proximity of two strings. 
-In that regards, we implemented a version of the Levenshtein Distance 
algorithm.
+In that regard, we implemented a version of the Levenshtein Distance algorithm.
 This algorithm can be explained by the following statement:
                                \begin{quotation}
 "The Levenshtein distance between two words is the minimum number of 
single-character edits (insertions, deletions or substitutions) required to 
change one word into the other."
-                               \end{quotation}                          
+                               \end{quotation} \cite{LevensteinDistance}\\*    
         
                                
-\begin{figure} 
+\begin{figure} [h!]
 \centering
 \begin{tabular}{ | l | l | l | l | l | l | c | r | }
   \hline                       
@@ -308,8 +310,8 @@ The distance for this example is $2\div8\times100$
 After hashing the file name and the function name into numerical values trough 
Levenshtein distance, the analyzer creates the triplet that numerically 
represents the stack trace being parsed. This triplet will be used in the 
clustering method detailed in the following paragraph.
 It is interesting to note that this triplet is not the most accurate 
representation of a stack trace. The analyzer will be improved in the future is 
that regard.
 
-                               \paragraph{The Scoring mechanism}
-The "score" (or rank) of a mutation is a numerical value that reflects how 
interesting the outcome was. Unique crashes and unexpected behavior are what 
makes a mutation valuable since it indicates a wrongly implemented code piece 
in the target source in most cases. \cite{CrashUniQ} This value is calculated 
through a modified version of a clustering method.
+                               \paragraph{The scoring mechanism}
+The "score" (or rank) of a mutation is a numerical value that reflects how 
interesting the outcome was. Unique crashes and unexpected behavior are what 
makes a mutation valuable since it indicates a wrongly implemented code piece 
in the target source in most cases. Unique crashes detection and processing was 
brightly detailed in Mr \cite{Unique}'s work which I recommend consulting for 
further explanation. This value is calculated through a modified version of a 
K-means clustering method  [...]
 This clustering mechanism runs as follows:
        \begin{itemize}
        \item{Represent the triplets in a 3 dimensional space}
@@ -319,22 +321,27 @@ This clustering mechanism runs as follows:
        \item{Add up all the distances generated by the last step into a single 
value}
        \end{itemize}
 
-The centroid of a cluster is the triplet of values that define the center of a 
cluster.
+The centroid of a cluster is the triplet of values that define the 
mathematical center of a cluster.
 the Euclidean distance is defined as
        \begin{quotation}
        In mathematics, the Euclidean distance or Euclidean metric is the 
"ordinary" straight-line distance between two points in Euclidean space
-       \end{quotation}
+       \end{quotation}\cite{EuclideanDistance} \\*
        
 If a triplet represents a unique crash it will be placed far away in the 
Euclidean space. This induces that the sum of the Euclidean distances to the 
centroids will be a high value compared to a common crash. This sum is then 
used as the "score" of the mutation. 
 
-Mutations that do note trigger any crash result in having a null score. 
Therefore, the side of the tree they are in has a lower statistical chance of 
being chosen for further exploration.
+Mutations that do not trigger any crash result in having a null score. 
Therefore, the side of the tree they are in has a lower statistical chance of 
being chosen for further exploration.
 
-For a more concrete view of what the analyzer outputs, please refer to the 
Result and Example section. 
-\begin{figure} [h!]
-  \includegraphics[width=\textwidth]{Scoring.png}
+For a more concrete view of what the analyzer outputs, please refer to the 
Result and Example section.
+
+\clearpage
+
+\begin{figure}
+\centering
+  \includegraphics[scale=0.13]{Scoring.png}
+  \caption{2D example of Euclidean Distance calculus }
 \end{figure}   
                \subsection{Known issues}               
-About one mutation out of 15 will fail for invalid reasons.
+About one mutation out of 15 will fail for undetermined reasons.
                        \subsubsection{Context Coherence}
 A significant amount of the failing mutations do so because of the mutation 
transfer mechanism. As said in the dedicated section, this mechanism applies 
more than one change to the database (potentially the whole database). In 
specific cases, this property can become problematic. 
 More specifically, when the main loop chooses the next mutation and its parent 
has been the subject of a transfer. In this case, the data embedded in the 
schemaFuzz data structure may not match the data that are present in the 
database, this delta may induce a wrong SQL statement that will result in a SQL 
error (in practice, the DBMS indicates that 0 rows were updated by the 
statement).
@@ -428,7 +435,9 @@ If the program crashed:
                \subsection{Results on the GNU Taler database}
 
 The outcome of the first executions of SchemaFuzz against a sample of the GNU 
Taler database were promising. The tool itself properly fuzzed the target and 
the execution ended with a success
-code on 9 of the 10 attempts.          
+code on 9 of the 10 attempts.
+
+\clearpage             
        
                \bigskip
                \begin{figure} [h!]
@@ -450,7 +459,7 @@ For instance, the tool would crash if meeting the following 
criteria:
        
 By running the tool on a more dense database, the bug had vanished. This 
allowed us to locate the origin of the issue.
 
-
+               \clearpage
                \bigskip
                \begin{figure} [h!]
                        \includegraphics[width=\textwidth]{sc4.png}
@@ -458,15 +467,12 @@ By running the tool on a more dense database, the bug had 
vanished. This allowed
                \end{figure}
                \bigskip
                        
-               
-                       \clearpage
+       
        \section{Upcoming features and changes}
 This section will provide more insights on the future features that 
might/may/will be implemented as well as the changes in the existing code.
-Any suggestion will be greatly appreciated as long as it is relevant. All the 
relevant information regarding the contributions are detailed in the so called 
section.
 
                \subsection{General Report}
-In its future state, SchemaFuzz will generate a synthesized report concerning 
the overall execution of the tool. This general report will primarily contain 
the most "interesting" mutations (meaning the mutations with the highest score 
mark) for the whole run.
-A more advanced version of this report would also take into account the code 
coverage ratio for each mutation and execute a last clustering round at the end 
of the execution to generate a "global" score that would represent the global 
value of each mutations.
+In its future state, SchemaFuzz will generate a synthesized report concerning 
the overall execution of the tool. This general report will primarily contain 
the most "interesting" mutations (meaning the mutations with the highest score 
mark) for the whole run. A more advanced version of this report would also take 
into account other parameters in calculating the ratio for each mutation. By 
doing so, the analyzer could generate a "global" score that would represent the 
global value of each [...]
        
                \subsection{Code coverage}
 We are considering changing or simply adding code coverage in the clustering 
method as a parameters. Not only would this increase the accuracy of the 
scoring but also give more detail on what the mutation triggered in the target 
software's code therefore helping locate the origin of the crash. By adding 
code coverage this tool could make a concrete difference in terms of scoring 
and informations being generated in the reports between a mutation with a new 
stack trace in a common code pat [...]
@@ -476,16 +482,16 @@ This idea for this feature to be is to implement some 
kind of "auto learning" me
 To be more precise, this routine is meant to performed a statistical analysis 
on a representative portion database's content. This analysis would provide the 
rest of the program the most common values encountered for each field. More 
generically, this would allow the software to have a global view over the 
format of the data that the database holds.
 Such global understanding of the content format is   interesting to make the 
modifications possibilities more relevant. Indeed, one of the major limitation 
of SchemaFuzz is its "blindness".
 That is to say that some of the modifications performed in the course the 
execution of the program are irrelevant due to the lack of information on what 
is supposed to be stored in this precise field.
-For instance, a field that only holds numerical values that go from 1 to 1000 
even if it has enough bits to encode from -32767 to 32767 would have a   low 
chance of triggering a crash if this software modifies its value from 10 to 55.
-on the other end, if the software modifies this   same field from 10 to 
-12000, then a crash is much more likely to pop up.
+For instance, a field that only holds hexadecimal values that go from A01 to 
A0A even if it has enough bits to encode from 000 to FFF would have a low 
chance of triggering a crash if this software modifies its value from A01 to 
A02.
+on the other end, if the software modifies this same field from A01 to B01, 
then a crash is much more likely to be discovered.
 Same principle applies to strings. Suppose a field can encode 10 characters.
-the pre-analysis, detected that, for this field, most of the value were 
surnames beginning with the letter "a". Changing this field from "Sylvain" to 
"Sylvaim" will probably not be effective. However, changing this same field 
from "Sylvain" to "NULL" might indeed triggered an unexpected behavior. 
+the pre-analysis, detected that, for this field, most of the value were 
surnames beginning with the letter  “a". Changing this field from  “Sylvain" to 
 “Sylvaim" will probably not be effective. However, changing this same field 
from  “Sylvain" to  “NULL" might indeed trigger an unexpected behavior. 
   
 This pre-analysis routine would only be executed once at the start of the 
execution, right after the meta data extraction. The result of this analysis 
will be held by a specific object. 
-this object's lifespan is equal to the duration of the main loop's execution 
(so that every mutation can benefits from the analysis data.)
+this object's lifespan is equal to the duration of the main loop's execution. 
That way, every mutation can benefits from the analysis data.
                
                \subsection{Centralized anonymous user data}
-SchemaFuzz's efficiency is tightly linked to the quality of its heuristics. 
this term includes the following points 
+SchemaFuzz's efficiency is tightly linked to the quality of its heuristics. 
This term includes the following points 
                \begin{itemize}
                \item{Quality of the possible modifications for a single field}
                \item{Quality of the possible modifications for each data type}
@@ -497,10 +503,7 @@ Knowing this, we are also considering for futures 
enhancements an anonymous data
 
 
        \section{Contributing}
-You can send your ideas at  \\*
-               address@hidden
-Or directly create a pull request on the official repository to edit this 
document and/or the code itself
-
+You can send your ideas and patch proposals at address@hidden \\*
 
 \appendix      
 \newpage
diff --git a/docs/EndToEndDiagram.pdf b/docs/EndToEndDiagram.pdf
index 80bc7cd..b90c547 100644
Binary files a/docs/EndToEndDiagram.pdf and b/docs/EndToEndDiagram.pdf differ
diff --git a/docs/EndToEndDiagram.tex b/docs/EndToEndDiagram.tex
index 3f7f7cf..602060c 100644
--- a/docs/EndToEndDiagram.tex
+++ b/docs/EndToEndDiagram.tex
@@ -84,9 +84,9 @@
             (g) rectangle (h);
     \end{pgfonlayer}    
     
-    \path (asrt1.north) + (3.5,0.7) node (asrs) {Main Loop};
+    \path (asrt1.north) + (3.95,0.7) node (asrs) {Main Loop};
         
-    \path (syscomb.east)+(-2.0,-11.5) node (block3) [sc2] {Rolling back 
Constraints \\ + \\ Building graphical tree};    
+    \path (syscomb.east)+(-2.0,-11.5) node (block3) [sc2] {Rolling back 
Constraints \\ + \\ Building mutation tree representation};    
    
   
     \begin{pgfonlayer}{background}

-- 
To stop receiving notification emails like this one, please contact
address@hidden



reply via email to

[Prev in Thread] Current Thread [Next in Thread]