changing report folder to doc

This commit is contained in:
nachocano 2014-12-02 11:28:20 -08:00
parent 2fab05c83e
commit e4abca9494
4 changed files with 10 additions and 5 deletions

View File

View File

@ -30,14 +30,14 @@ In this work, we propose RABIT, an AllReduce library suitable for distributed ma
\section{Introduction}
Distributed machine learning is an active research area that has seen an incredible grow in recent years. Several approaches have been proposed, e.g. parameter server abstraction, graph approaches, among others \cite{paramServer,DuchiAW12,Zinkevich,Dekel,Low}. The closest example to our work is proposed by Agarwal et al. \cite{Agarwal}, in which they have a tree-shape communication infrastructure that efficiently accumulates and broadcasts values to every node involved in a computation.
\todo{add more}
\section{AllReduce}
In AllReduce settings, nodes are organized in a tree structure. Each node holds a portion of the data and computes some values on it. Those values are passed up the tree and aggregated, until a global aggregate value is calculated in the root node (reduce). The global value is then passed down to all other nodes (broadcast).
Figure \ref{allreduce} shows an example of an AllReduce sum operation. The leaf nodes passed data to their parents (interior nodes). Such interior nodes compute an intermediate aggregate and pass the value to the root, which in turn computes the final aggregate and then passes back the result to every node in the cluster.
\todo{add more}
\begin{figure}[tb]
\centering
@ -55,7 +55,7 @@ Figure \ref{allreduce} shows an example of an AllReduce sum operation. The leaf
The design of RABIT was motivated by the following needs:
\begin{enumerate}
\item \emph{Distributed}: machine learning algorithms are inherently iterative and computation intensive. Given the vast amount of data they can work on, it may be intractable to perform all the processing on a single machine. Instead, we want to divide the computation into different nodes, each one would be in charge of computing statistics on some portion of the data, and then a combination step would take place, where all those independent local solutions will be aggregated into a single result.
\item \emph{Distributed}: machine learning algorithms are inherently iterative and computation intensive. Given the vast amount of data they can work on, it may be intractable to perform all the processing on a single machine. Instead, we want to divide the computation into different nodes, each one would be in charge of computing statistics on some portion of the data, and then have a combination step, where all those independent local solutions will be aggregated into a single result.
\item \emph{Scalability}: we want our solution to handle a growing amount of work in a capable manner, i.e. we should be able to accommodate to data and computation growth by adding more nodes.
\item \emph{Fault Tolerance}: we assume an environment where failures happen, either machines can go down or communication failures occur. Given the computation intensive nature of machine learning problems, we want to be able to continue operating properly in the event of a failure, instead of starting the process all over again.
\item \emph{Programmability}: we want to provide a clean interface that can be easily used by programmers. With few lines of code, they should be able to have a fault-tolerant AllReduce implementation.
@ -64,9 +64,14 @@ The design of RABIT was motivated by the following needs:
\item \emph{Footprint}: we want to have a low memory footprint while running as well as provide a lightweight footprint library.
\end{enumerate}
\subsection{Interface}
\subsection{Proposed Solution}
\todo{add sync module interface, example of how to use the library}
\todo{what we did}
\subsubsection{Interface}
\todo{API, how to use it}
\section{Evaluation}