diff --git a/paper/paper.pdf b/paper/paper.pdf index e3cd39f325dcb6771acfb55850884c121162daf4..9cf8a0c60387ff894514e94beaeb1a06d7f45a02 100644 Binary files a/paper/paper.pdf and b/paper/paper.pdf differ diff --git a/paper/paper.tex b/paper/paper.tex index 01846a7f1aecf491a44879d79bf0e267dd3cd98c..db6be898c1b35e3e4c43b394429f65cd87397b3b 100644 --- a/paper/paper.tex +++ b/paper/paper.tex @@ -10,7 +10,7 @@ \setlength{\pdfpageheight}{11in} \pdfinfo{ /Title (Sequence-to-sequence Architecture Using BERT) - /Author (Claudio Scheer, Fernando Possebon) + /Author (Claudio Scheer, José Fernando Possebon) } \setcounter{secnumdepth}{0} \begin{document} @@ -18,10 +18,10 @@ % proceedings, working notes, and technical reports. % \title{Sequence-to-sequence Architecture\\Using BERT} -\author{Claudio Scheer \and Fernando Possebon\\ +\author{Claudio Scheer \and José Fernando Possebon\\ Pontifical Catholic University of Rio Grande do Sul - PUCRS\\ claudio.scheer@edu.pucrs.br, - claudio.scheer@edu.pucrs.br + jose.possebon@edu.pucrs.br } \maketitle @@ -41,9 +41,11 @@ Related works. \section{Deep Learning} +In this section, we will discuss the sequence-to-sequence model using recurrent neural networks and transformers. Also, in a nutshell, we discuss how a BERT model works. + \subsection{Sequence-to-sequence} -The encoder-decoder architecture was initially proposed by \cite{DBLP:journals/corr/ChoMGBSB14}. Although simple, the idea was powerful: use a recurrent network to encode the input data and a decoder to transform the encoded input into the desirable output. +The encoder-decoder architecture was initially proposed by \cite{DBLP:journals/corr/ChoMGBSB14}. Although simple, the idea is powerful: use a recurrent neural network to encode the input data and a recurrent neural network to decode the encoded input into the desirable output. Two neural networks are trained. \cite{DBLP:journals/corr/Graves13} - Generating sequences with LSTM @@ -56,19 +58,20 @@ The encoder-decoder architecture was initially proposed by \cite{DBLP:journals/c \cite{DBLP:journals/corr/abs-1810-04805} - BERT - -\subsection{Sequence-to-sequence BERT} - Similarly to the original sequence-to-sequence model using a recurrent neural network, the model discussed in this paper uses two BERT neural network: one neural network to encode the input and another to decode the input encoded. \section{Dataset} -As we focused our project on automatic email response, we used The Enron Email Dataset\footnote{\href{https://www.kaggle.com/wcukierski/enron-email-dataset}{https://www.kaggle.com/wcukierski/enron-email-dataset}} to train our model. The dataset contains only the raw data from the emails. Therefore, we created a parser\footnote{\href{https://www.kaggle.com/claudioscheer/extract-reply-emails}{https://www.kaggle.com/claudioscheer/extract-reply-emails}} to extract the email and the replies from each email. +As we focused our project on automatic email reply, we used The Enron Email Dataset\footnote{\href{https://www.kaggle.com/wcukierski/enron-email-dataset}{https://www.kaggle.com/wcukierski/enron-email-dataset}} to train our model. The dataset contains only the raw data of the emails. Therefore, we created a parser\footnote{\href{https://www.kaggle.com/claudioscheer/extract-reply-emails}{https://www.kaggle.com/claudioscheer/extract-reply-emails}} to extract the email and the replies from each email. + +To identify whether an email has a reply or not, we look for emails that contain the string \texttt{-----Original Message-----}. After filtering only emails with non-empty replies, we parse those emails in an input sequence (the original email) and in the target sequence (the reply email). The entire extraction was done automatically, that is, we did not manually extract or adjust any email. + +We used two libraries to parse the dataset: \texttt{talon}\footnote{\href{https://github.com/mailgun/talon}{https://github.com/mailgun/talon}}, provided by Mailgun, and \texttt{email}, provided by Python. The \texttt{email} package returns the email body with the entire thread. To extract only the last reply from an email thread, we use the \texttt{talon} package. -To identify whether an email has a reply or not, we look for emails that contain the string \texttt{-----Original Message-----}. After filtering only emails with non-empty replies, we parse those emails in an input sequence (the original email) and in the target sequence (the reply email). The entire extraction was done automatically, that is, we did not manually extract or adjust any input. +The original dataset contains 517,401 raw emails. After parsing the raw dataset, we created a dataset with 110,205 input and target pairs. -The original dataset contains 517,401 raw emails. After parsing the raw dataset, we created a dataset with 110,205 input and target pairs. Not all pairs were parsed correctly. In the parsed dataset, 8,368 pairs have specifics patterns. As this data does not represent a large part of the dataset, we trained the dataset with this "wrong" data. +In the parsed dataset, we have 8,368 pairs with specifics email and reply patterns. Since these pairs do not represent a large part of the dataset, we trained the dataset with this "wrong" data. \bibliographystyle{aaai}