This commit is contained in:
2026-05-22 16:38:58 +02:00
parent e356879542
commit 78ce34d1bc
5 changed files with 167 additions and 82 deletions

View File

@@ -1,50 +1,102 @@
\section{Theory\label{sec:theor}}
%\subsection{EDA - Exploratory Data Analysis}
%\subsubsection{Data preprocessing and cleaning}
%\subsubsection{Data reliability and consistency}
%\subsubsection{Data visualisation principles}
%\subsection{Web traffic}
% HAR files
%\subsubsection{HTTP requests and responses}
%
%\subsubsection{Cookies and tracking parameters}
%\subsubsection{Search engines and privacy}
%%%
\subsection{Web Tracking Technologies}
\subsubsection{HTTP Requests and Responses}
\subsubsection{Cookies and Tracking Parameters}
\subsection{Data Collection and Preprocessing}
\subsubsection{HAR Files}
\subsubsection{Data Preprocessing and Cleaning}
\subsection{Exploratory Data Analysis (EDA)}
\subsection{Data Visualisation Principles}
This section explores the necessary theory for this work. There are two main parts of the theory: one that covers Noroff's course materials, and one that covers topics that are necessary in order to understand the progress of this work. The topic necessary to understand for this work are \texttt{EDA - Exploratory Data Analysis}, \texttt{VENV - Virtual Environment}, and \texttt{HAR - HTTP Archive} will be explored in subsection~\ref{sec:theor:eda}, \ref{sec:theor:venv}, and~\ref{sec:theor:har} respectively.
% “data preprocessing” → du gjør faktisk HAR → CSV-transformasjon
% “EDA” → du lager konkrete Power BI-visualiseringer
% “data pipeline” → du automatiserer hele workflowen
% “data reliability” → du standardiserer browser/proxy conditions
% “data collection methodology” → du dokumenterer Playwright/Tor-oppsettet
\subsection{EDA - Exploratory Data Analysis\label{sec:theor:eda}}
EDA - Exploratory Data Analysis, this is a crucial part of a data analysis process. It takes the process from when retrieved the raw dataset, and investigates how to present it professionally, and academically. \cite{noroff_modules}
To visualize a raw, unfiltered dataset, some decisions and restrictions need to be made before the data can be presented. Take, for example, a dataset where a lot of data is missing, and some data in the table is misleading. The EDA process is supposed to make the unfiltered data ready for visualization. During that process, several important questions need to be stated in order to narrow the research topic and the findings to one specific theme.
One example the EDA process may narrow the dataset, is to focus on one specific data variables, and investigates the trends corresponding that variables. Take for instance in this course, all dataset is filtered for data parameters where variable \texttt{tracking\_hint} is equal to yes, because this work tries to identify the trends of \texttt{HTTP} request and responses, when the dataset returns that the entry of interest addresses that it could be a tracking parameter.
The EDA process may investigate outliers in the dataset, visually, which do not seem to reflect the rest of the dataset. For example why some parameters may have a lot more cookie count per entry than other entries. And if the outliers actually reflect the data correctly, why may the trend be different in some cases than others.
\subsection{VENV - Virtual Environment\label{sec:theor:venv}}
This section covers the theory of the methodology for this work. To automate the process to retrieve raw dataset as \texttt{HAR}~files, Virtual Environment was used to create and run scripts that simulate the process of manually retrieving raw \texttt{HAR}~files when doing web~queries. More on that in \autoref{sec:metho}
\begin{lstlisting}[caption={Creating the virtual environment using python for this work}, label={lst:theor:venv}]
cd "$WORKSPACE"
python -m venv .example-env
source .example-env/bin/activate
pip install pytest-playwright
playwright install chromium
playwright install firefox
\end{lstlisting}
\autoref{lst:theor:venv} demonstrates how to create a virtual environment using python to run specific python scripts with specific packages. In \autoref{lst:theor:venv} illustrates how to install playwright and some web~browser in that virtual environment. When installing packages and program into the virtual environment, all programs are stored in following folder: \texttt{"\$WORKSPACE"/.example-env/}, in the folder created when running line~2 in \autoref{lst:theor:venv}. And once executed line~3, all files in folder \texttt{"\$WORKSPACE"/.example-env/bin/} are executable from the command~line \autocite{VENV,Playwright}.
\subsection{HAR - HTTP Archive\label{sec:theor:har}}
\texttt{HAR} is short for HTTP Archive. Every time a web-browser is loaded, it performs many thousand requests and returns several responses. Each one is stored in a table in the web~browser, where all parameters may be visualized. It could either be a \texttt{POST} request, or a \texttt{GET} request \autocite{HAR}, either way, each response or request returns as an entry in this table. The table may be downloaded for inspection. All data for the latest web~searches may be found in that table. When \texttt{HAR} table is downloaded, it is not downloaded as a \texttt{CSV}~files, but instead as a \texttt{.har} file. And in order to make use of that dataset, it has to be transformed into something that Power~BI may read. The process of this is explained in \autoref{sec:metho}.
Cookies are essential in order for a web~browser to work normally. It contains login information for the current user experience. For example when a customer is logged in to a website to shop, the shop~cart uses cookies, information about the customer to store the current session. So that next time a customer opens the same website, the same items are still stored in the shopping~cart \autocite{MDNcookies}. This is convenient for most cases, but for some cases it can be used for tracking, and for personalize ads on your web-browser.
\subsection{Proxy\label{sec:theor:proxy}}
A proxy acts as an intermediary between a client and the destination website. This means that all web traffic goes through a virtual tunnel before it reaches the destination. For every web query performed, the web traffic has to go through another server before it reaches its destination. This helps to hide the user's identity on the web because the destination can only see the intermediary and not the client itself. This means that when you visit a website, it cannot see your public IP address, only the public IP address of the proxy being used.
This is where Tor comes in. Tor routes your traffic through multiple servers using encryption. Therefore, when it reaches its destination, the source of the request is essentially untraceable \autocite{TOR,CloudflareProxy}.
% %\subsection{EDA - Exploratory Data Analysis}
% %\subsubsection{Data preprocessing and cleaning}
% %\subsubsection{Data reliability and consistency}
% %\subsubsection{Data visualisation principles}
% %\subsection{Web traffic}
% % HAR files
% %\subsubsection{HTTP requests and responses}
% %
% %\subsubsection{Cookies and tracking parameters}
% %\subsubsection{Search engines and privacy}
% %%%
% \subsection{Web Tracking Technologies}
% \subsubsection{HTTP Requests and Responses}
% \subsubsection{Cookies and Tracking Parameters}
% \subsection{Data Collection and Preprocessing}
% \subsubsection{HAR Files}
% \subsubsection{Data Preprocessing and Cleaning}
% \subsection{Exploratory Data Analysis (EDA)}
% \subsection{Data Visualisation Principles}
% % “data preprocessing” → du gjør faktisk HAR → CSV-transformasjon
% % “EDA” → du lager konkrete Power BI-visualiseringer
% % “data pipeline” → du automatiserer hele workflowen
% % “data reliability” → du standardiserer browser/proxy conditions
% % “data collection methodology” → du dokumenterer Playwright/Tor-oppsettet