Compare commits
2 Commits
e356879542
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
77be37b8f5
|
|||
|
78ce34d1bc
|
2
.gitignore
vendored
2
.gitignore
vendored
@@ -9,7 +9,7 @@
|
||||
|
||||
# PDF (valgfritt)
|
||||
*.pdf
|
||||
|
||||
!main.pdf
|
||||
# Temporary
|
||||
*.blg
|
||||
*.bbl
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
{"rule":"WANT_TO_NN","sentence":"^\\QThe main priority is the variable Trackinghints, as the work tries to identity correspondanses between trackings and cookies.\\E$"}
|
||||
2
report/.vscode/settings.json
vendored
2
report/.vscode/settings.json
vendored
@@ -28,7 +28,7 @@
|
||||
}
|
||||
],
|
||||
|
||||
"ltex.language": "en-GB",
|
||||
"ltex.language": "en-US",
|
||||
"ltex.enabled": [
|
||||
"latex"
|
||||
],
|
||||
|
||||
BIN
report/main.pdf
Normal file
BIN
report/main.pdf
Normal file
Binary file not shown.
@@ -53,7 +53,15 @@
|
||||
\begin{abstract}
|
||||
\label{abs:abstr}
|
||||
\centering
|
||||
\lipsum[1]
|
||||
|
||||
|
||||
%This report aims to find trends in Web~Search activity connected to \texttt{tracking\_hints} and cookies. Playwright is used automate the process of performing web-queries. Data is downloaded as \texttt{HAR}, and transformed to \texttt{CSV} files using automation scripts. \texttt{CSV} files are loaded into Power~BI, to be transformed and visualized using Power~Query. It is found that private-focused tools tend to respect privacy in web-queries. And at last, Google leaks all user privacy.
|
||||
|
||||
|
||||
This report aims to find trends in Web~Search activity connected to \texttt{tracking\_hints} and cookies. Playwright is used to automate the process of performing web-queries. Data is downloaded as \texttt{HAR} files and transformed into \texttt{CSV} files using automation scripts. The \texttt{CSV} files are loaded into Power~BI, where the data is transformed and visualized using Power~Query. It is found that privacy-focused tools tend to better respect privacy in web-queries, while Google shows significantly more tracking-related behavior and privacy leakage compared to the other Search~Engines.
|
||||
|
||||
|
||||
|
||||
\end{abstract}
|
||||
\clearpage
|
||||
\tableofcontents
|
||||
|
||||
@@ -10,7 +10,18 @@
|
||||
langid = {english},
|
||||
file = {Snapshot:/home/tvh/snap/zotero-snap/common/Zotero/storage/R5P9688K/about-tor-browser.html:text/html},
|
||||
}
|
||||
|
||||
@online{CloudflareProxy,
|
||||
author = {{Cloudflare}},
|
||||
title = {What is a reverse proxy? Proxy servers explained},
|
||||
url = {https://www.cloudflare.com/learning/cdn/glossary/reverse-proxy/},
|
||||
urldate = {2026-05-22}
|
||||
}
|
||||
@online{MDNcookies,
|
||||
author = {{MDN Web Docs}},
|
||||
title = {Using HTTP cookies},
|
||||
url = {https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cookies},
|
||||
urldate = {2026-05-22}
|
||||
}
|
||||
@online{HAR,
|
||||
title = {Network request list — Firefox Source Docs documentation},
|
||||
url = {https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/request_list/index.html?utm_source=chatgpt.com},
|
||||
|
||||
@@ -1,50 +1,102 @@
|
||||
\section{Theory\label{sec:theor}}
|
||||
|
||||
|
||||
%\subsection{EDA - Exploratory Data Analysis}
|
||||
|
||||
%\subsubsection{Data preprocessing and cleaning}
|
||||
%\subsubsection{Data reliability and consistency}
|
||||
%\subsubsection{Data visualisation principles}
|
||||
|
||||
%\subsection{Web traffic}
|
||||
|
||||
% HAR files
|
||||
|
||||
%\subsubsection{HTTP requests and responses}
|
||||
|
||||
%
|
||||
|
||||
%\subsubsection{Cookies and tracking parameters}
|
||||
|
||||
|
||||
%\subsubsection{Search engines and privacy}
|
||||
|
||||
|
||||
%%%
|
||||
\subsection{Web Tracking Technologies}
|
||||
|
||||
\subsubsection{HTTP Requests and Responses}
|
||||
|
||||
\subsubsection{Cookies and Tracking Parameters}
|
||||
|
||||
\subsection{Data Collection and Preprocessing}
|
||||
|
||||
\subsubsection{HAR Files}
|
||||
|
||||
\subsubsection{Data Preprocessing and Cleaning}
|
||||
|
||||
\subsection{Exploratory Data Analysis (EDA)}
|
||||
|
||||
\subsection{Data Visualisation Principles}
|
||||
This section explores the necessary theory for this work. There are two main parts of the theory: one that covers Noroff's course materials, and one that covers topics that are necessary in order to understand the progress of this work. The topic necessary to understand for this work are \texttt{EDA - Exploratory Data Analysis}, \texttt{VENV - Virtual Environment}, and \texttt{HAR - HTTP Archive} will be explored in subsection~\ref{sec:theor:eda}, \ref{sec:theor:venv}, and~\ref{sec:theor:har} respectively.
|
||||
|
||||
|
||||
|
||||
% “data preprocessing” → du gjør faktisk HAR → CSV-transformasjon
|
||||
% “EDA” → du lager konkrete Power BI-visualiseringer
|
||||
% “data pipeline” → du automatiserer hele workflowen
|
||||
% “data reliability” → du standardiserer browser/proxy conditions
|
||||
% “data collection methodology” → du dokumenterer Playwright/Tor-oppsettet
|
||||
\subsection{EDA - Exploratory Data Analysis\label{sec:theor:eda}}
|
||||
|
||||
|
||||
EDA - Exploratory Data Analysis, this is a crucial part of a data analysis process. It takes the process from when retrieved the raw dataset, and investigates how to present it professionally, and academically. \cite{noroff_modules}
|
||||
|
||||
|
||||
To visualize a raw, unfiltered dataset, some decisions and restrictions need to be made before the data can be presented. Take, for example, a dataset where a lot of data is missing, and some data in the table is misleading. The EDA process is supposed to make the unfiltered data ready for visualization. During that process, several important questions need to be stated in order to narrow the research topic and the findings to one specific theme.
|
||||
|
||||
One example the EDA process may narrow the dataset, is to focus on one specific data variables, and investigates the trends corresponding that variables. Take for instance in this course, all dataset is filtered for data parameters where variable \texttt{tracking\_hint} is equal to yes, because this work tries to identify the trends of \texttt{HTTP} request and responses, when the dataset returns that the entry of interest addresses that it could be a tracking parameter.
|
||||
|
||||
The EDA process may investigate outliers in the dataset, visually, which do not seem to reflect the rest of the dataset. For example why some parameters may have a lot more cookie count per entry than other entries. And if the outliers actually reflect the data correctly, why may the trend be different in some cases than others.
|
||||
|
||||
|
||||
\subsection{VENV - Virtual Environment\label{sec:theor:venv}}
|
||||
|
||||
This section covers the theory of the methodology for this work. To automate the process to retrieve raw dataset as \texttt{HAR}~files, Virtual Environment was used to create and run scripts that simulate the process of manually retrieving raw \texttt{HAR}~files when doing web~queries. More on that in \autoref{sec:metho}
|
||||
|
||||
|
||||
\begin{lstlisting}[caption={Creating the virtual environment using python for this work}, label={lst:theor:venv}]
|
||||
cd "$WORKSPACE"
|
||||
python -m venv .example-env
|
||||
source .example-env/bin/activate
|
||||
pip install pytest-playwright
|
||||
playwright install chromium
|
||||
playwright install firefox
|
||||
\end{lstlisting}
|
||||
|
||||
\autoref{lst:theor:venv} demonstrates how to create a virtual environment using python to run specific python scripts with specific packages. In \autoref{lst:theor:venv} illustrates how to install playwright and some web~browser in that virtual environment. When installing packages and program into the virtual environment, all programs are stored in following folder: \texttt{"\$WORKSPACE"/.example-env/}, in the folder created when running line~2 in \autoref{lst:theor:venv}. And once executed line~3, all files in folder \texttt{"\$WORKSPACE"/.example-env/bin/} are executable from the command~line \autocite{VENV,Playwright}.
|
||||
|
||||
|
||||
\subsection{HAR - HTTP Archive\label{sec:theor:har}}
|
||||
|
||||
\texttt{HAR} is short for HTTP Archive. Every time a web-browser is loaded, it performs many thousand requests and returns several responses. Each one is stored in a table in the web~browser, where all parameters may be visualized. It could either be a \texttt{POST} request, or a \texttt{GET} request \autocite{HAR}, either way, each response or request returns as an entry in this table. The table may be downloaded for inspection. All data for the latest web~searches may be found in that table. When \texttt{HAR} table is downloaded, it is not downloaded as a \texttt{CSV}~files, but instead as a \texttt{.har} file. And in order to make use of that dataset, it has to be transformed into something that Power~BI may read. The process of this is explained in \autoref{sec:metho}.
|
||||
|
||||
|
||||
Cookies are essential in order for a web~browser to work normally. It contains login information for the current user experience. For example when a customer is logged in to a website to shop, the shop~cart uses cookies, information about the customer to store the current session. So that next time a customer opens the same website, the same items are still stored in the shopping~cart \autocite{MDNcookies}. This is convenient for most cases, but for some cases it can be used for tracking, and for personalize ads on your web-browser.
|
||||
|
||||
|
||||
|
||||
\subsection{Proxy\label{sec:theor:proxy}}
|
||||
|
||||
|
||||
|
||||
A proxy acts as an intermediary between a client and the destination website. This means that all web traffic goes through a virtual tunnel before it reaches the destination. For every web query performed, the web traffic has to go through another server before it reaches its destination. This helps to hide the user's identity on the web because the destination can only see the intermediary and not the client itself. This means that when you visit a website, it cannot see your public IP address, only the public IP address of the proxy being used.
|
||||
|
||||
This is where Tor comes in. Tor routes your traffic through multiple servers using encryption. Therefore, when it reaches its destination, the source of the request is essentially untraceable \autocite{TOR,CloudflareProxy}.
|
||||
|
||||
|
||||
|
||||
% %\subsection{EDA - Exploratory Data Analysis}
|
||||
|
||||
% %\subsubsection{Data preprocessing and cleaning}
|
||||
% %\subsubsection{Data reliability and consistency}
|
||||
% %\subsubsection{Data visualisation principles}
|
||||
|
||||
% %\subsection{Web traffic}
|
||||
|
||||
% % HAR files
|
||||
|
||||
% %\subsubsection{HTTP requests and responses}
|
||||
|
||||
% %
|
||||
|
||||
% %\subsubsection{Cookies and tracking parameters}
|
||||
|
||||
|
||||
% %\subsubsection{Search engines and privacy}
|
||||
|
||||
|
||||
% %%%
|
||||
% \subsection{Web Tracking Technologies}
|
||||
|
||||
% \subsubsection{HTTP Requests and Responses}
|
||||
|
||||
% \subsubsection{Cookies and Tracking Parameters}
|
||||
|
||||
% \subsection{Data Collection and Preprocessing}
|
||||
|
||||
% \subsubsection{HAR Files}
|
||||
|
||||
% \subsubsection{Data Preprocessing and Cleaning}
|
||||
|
||||
% \subsection{Exploratory Data Analysis (EDA)}
|
||||
|
||||
% \subsection{Data Visualisation Principles}
|
||||
|
||||
|
||||
|
||||
% % “data preprocessing” → du gjør faktisk HAR → CSV-transformasjon
|
||||
% % “EDA” → du lager konkrete Power BI-visualiseringer
|
||||
% % “data pipeline” → du automatiserer hele workflowen
|
||||
% % “data reliability” → du standardiserer browser/proxy conditions
|
||||
% % “data collection methodology” → du dokumenterer Playwright/Tor-oppsettet
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -1,6 +1,25 @@
|
||||
\section{Introduction\label{sec:intro}}
|
||||
|
||||
\begin{figure}[h]
|
||||
\subsection*{Background}
|
||||
This project was delivered from Noroff Academy. A student could either choose to work in a group, or work independently. The topic for the project could either be chosen by professor and teacher at Noroff Academy, or the student could choose their own topic.
|
||||
|
||||
|
||||
For this scope, the student chose to work independently with a self-chosen topic. The student has experienced that work may lose interest if there is nothing driving the curiosity behind the project. For that reason, the topic for this work is to find trends in \texttt{tracking\_hints} in web-queries. The student has a strong interest in networking, system administration, and Linux-based environments through personal projects and private experimentation. That is why this topic is chosen, to make use of personal skills in an academic work. Overall, the main work environment and method used in this project are thought in Noroff Academy. This aspect of the work creates motivation to complete the project, because the topic is of personal interest to the student.
|
||||
|
||||
|
||||
|
||||
As visualized in \autoref{fig:intro:01_thirdparty-domains}, which presents entries returning \texttt{tracking\_hints=yes}, Google was the only Search~Engine that consistently interacted with third-party domains. This finding is significant because Google also heavily used cookies on entries associated with \texttt{tracking\_hints}, as further discussed throughout this paper.
|
||||
|
||||
|
||||
|
||||
|
||||
\subsection*{Hypothesis}
|
||||
|
||||
The student's hypothesis for this work is that DuckDuckGo and Brave will return fewer instances of \texttt{tracking\_hints} and cookies. On the other hand, the student expects Google and Bing to use cookies alongside \texttt{tracking\_hints}, as these Search~Engines are known for tracking user activity and providing personalised advertisements in the web browser.
|
||||
|
||||
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=0.95\linewidth,
|
||||
@@ -9,10 +28,12 @@
|
||||
\label{fig:intro:01_thirdparty-domains}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Background\label{sec:intro:background}}
|
||||
|
||||
\subsection{Problem statement\label{sec:intro:statement}}
|
||||
|
||||
\subsection{Research objectives\label{sec:intro:research}}
|
||||
|
||||
\subsection{Hypotheses\label{sec:intro:hypotheses}}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -31,12 +31,15 @@ Furthermore, request cookies counts and response cookies counts will be presente
|
||||
\end{figure}
|
||||
|
||||
|
||||
Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} illustrate the cookie counts for each tracking hint. DuckDuckGo and Brave show no cookies across all \texttt{tracking\_hints}. In contrast, only Bing and Google use cookies on entries identified as tracking hints.
|
||||
Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} illustrate the cookie counts for each tracking hint. DuckDuckGo and Brave show no cookies across all \texttt{tracking\_hints}. In contrast, only Bing and Google use cookies on entries identified as tracking hints. The number above the name of the Search~Engine indicates the number of cookies, and the left bar indicates how many instances of those are associated with a tracking hint. Both Bing and Google shows cookies count on requests, but only Google shows cookies count on response.
|
||||
|
||||
|
||||
Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} express how using a proxy influences tracking hits on the web. Using Tor as a proxy returns no cookie responses on tracking hints, while some cookies still leak through in request cookies associated with tracking hints.
|
||||
|
||||
|
||||
|
||||
|
||||
\begin{figure}[h]
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
@@ -48,43 +51,7 @@ Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} i
|
||||
|
||||
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/05_bing-cookies.png}
|
||||
\caption{Cookies comparison in Bing}
|
||||
\label{fig:resul:05_bing-cookies}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/06_brave-cookies.png}
|
||||
\caption{Cookies comparison in Brave}
|
||||
\label{fig:resul:06_brave-cookies}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/07_ddg-cookies.png}
|
||||
\caption{Cookies comparison in DuckDuckGo}
|
||||
\label{fig:resul:07_ddg-cookies}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/08_google-cookies.png}
|
||||
\caption{Cookies comparison in Google}
|
||||
\label{fig:resul:08_google-cookies}
|
||||
\end{figure}
|
||||
|
||||
|
||||
In \autoref{sec:appen:figures} additional visualization on cookies comparison on request and responses are added.
|
||||
|
||||
% \subsection{Dataset overview}
|
||||
|
||||
|
||||
@@ -1,10 +1,136 @@
|
||||
\section{Discussion\label{sec:discu}}
|
||||
|
||||
%%% CAPTCHA med google!! derfor pdf 04
|
||||
% \subsection{Interpretation of findings}
|
||||
Throughout this paper, several statements and choices have been made without been accounted for in earlier sections. This section seeks to explain and connect all the loose threads. Why is \texttt{HAR} files used, and fully explain the order of methodology, and every part of it. Lastly, results will be explained in this section, in \autoref{sec:discu:result}.
|
||||
%%% Differences in in Search Engines and browser
|
||||
|
||||
% \subsection{Privacy implications}
|
||||
%%% Proxy reduces the cookies lest behind on the web. Tacking hint is going down.
|
||||
|
||||
% \subsection{Reliability and limitations}
|
||||
% Use of AI.
|
||||
|
||||
% \subsection{Ethical considerations}
|
||||
% 4.1 Discussion of Methodology and Theoretical Choices
|
||||
\subsection{Discussion of Methodology and Theory\label{sec:discu:choices}}
|
||||
% Her diskuterer du:
|
||||
|
||||
|
||||
As explained, this work aims to connect the trends between \texttt{tracking\_hint} and cookies in web~queries. There are many ways to approach this. Each web~queries had to be done anonymously. Any cookies stored in a web-browser leaks personalized data for the current user when performing a web~query. To make process of collection data, EDA, as anonymously as possible, a new browser profile was created for this cause. However, for each web-query, the cookies in this new browser profile had to be deleted. To make this as anonymous as possible, all DNS blocking and personalized configuration for student's private network was disabled for a limited period of time. The student's personal computer was used in this EDA process. The only data that might point to the student is such as the official IP addresses. Such IP addresses are dynamically assigned from the IPS provider, so that is not a crucial leakage of data.
|
||||
|
||||
The only data parameters visualized in Power~BI is those of importance for supporting the student's theories. Such parameters as \texttt{Request Cookies}, \texttt{Response Cookies} and \texttt{tracking\_hint}.
|
||||
|
||||
|
||||
\subsubsection*{Dataset parameters}
|
||||
|
||||
For this course only entries for when \texttt{tracking\_hints=yes} are inspected. Any entry for when \texttt{tracking\_hints=no}, is filtered out. This is due to main problem on this work, to find trends on web~queries when the entry hints to tracking. This means all other entries are of no interest for this work.
|
||||
|
||||
The parameter \texttt{tracking\_hint} indicates any entry on web~requests or responses that could be tracking or not. Might not be tracking, but the entry thinks that it may be tracking. This means the number of instances of \texttt{tracking\_hints} does not reflect how much tracking that is actually queried on that one web~query. In contrast, if a Search~Engines many \texttt{tracking\_hints}, it may reflect that Search~Engine used may be paranoid toward any hints for tracking. Likely when a motorcyclist may be paranoid on any car on the road. This is a good thing, because it allows the user to inspect those parameter that hints to tracking, and inspect what source the entry has. However, what is worth noticing is when tracking hint shows cookies requests and responses on the same entries. Cookies are data of information on the current session of the user using the web~browser. Meaning if an entry returns tracking hint = yes and no cookies, then the server on the other and where the request is retrieved from has no data on you. In contrast, if the entry return both cookies and \texttt{tracking\_hints} = yes, then this means that the server on the other end that might track your activity on the web has personal data on you from the web queries.
|
||||
|
||||
|
||||
|
||||
Meaning DuckDuckGo has a high occurrence rate of \texttt{tracking\_hints} that could have indicated to that the Search~Engine did not filter out any entries that may point tracking. However, that is not the case, this high occurrence rate actually addresses all cases that could be tracking, that other Search~Engines did not detect. So a high occurrences rate on \texttt{tracking\_hints} does not reflect your data actually being tracked. It is only when the entries where tracking hint is equal to yes, and the current entry returns cookie count, that your data could be tracked. From figures in \autoref{sec:resul}, only entries from Search~Engines such as Google and Bing return cookies on entries where \texttt{tracking\_hints} is equal to yes. When preform data collection, the student by habit rejected cookies on Google Search~Engines, and still the Search~Engine returned cookies, while DuckDuckGo and Brave did not return any cookies, on neither requests nor responses.
|
||||
|
||||
|
||||
|
||||
\subsubsection*{Tor proxy}
|
||||
|
||||
|
||||
|
||||
Tor proxy was used to help hide the user's identity on the web. As explained in \autoref{sec:theor:proxy}, a proxy routes traffic through an intermediary server before accessing the web. Tor proxy tunnels traffic through several relay servers, making direct identification of the user more difficult. This is because the destination web-server does not see the user's original IP address, but instead observes the IP address of the Tor exit relay.
|
||||
|
||||
|
||||
The results show that fewer entries on cookies for \texttt{tracking\_hints} when using a Tor proxy. This is implicit due to that web servers are able to identify Tor exit relay by its IP address, so any tracking do not point to the user. So any malware, or unethical hackers are not able to track your activity on web. However, normal websites are able to identify any Tor exit relay. For example, finn.no block any incoming requests from a Tor exit relay. This is due to that Tor is not a trusted exit relay. Trusted exit relay are normally connected to home routers to any normal household using, for example, Telenor as ISP. That is because anyone accessing the web through Telenor's ISP is automatically assumed to be a Norwegian human being accessing the web. What's interesting, even though when a Tor proxy is used, only Google and Bing Search~Engines returns cookies on requests, and only Google returns cookies on response as well. Meaning Google uses cookies still when user is using proxy.
|
||||
|
||||
|
||||
|
||||
|
||||
%%Tor
|
||||
|
||||
|
||||
|
||||
\subsubsection*{Working environment} %% such as Playwright, why it was used instead of manually performing web queires
|
||||
|
||||
%The working environment for this project was segregated into two steps. One for data collection, and one for data analysis. The process of data collection used in this project is out of the scope for the learning materials at Noroff Academy. However, the EDA process plays a crucial role for this process. What parameters to show, how to retrieve data efficiently, etc. The count of different queries is twelve, meaning, it would take too much time resources to collect this data doing it manually. Codex was used to create the automated scripts to automate the process of retrieving data, see sections \ref{sec:appen:auto}. However, the student has investigated the dataset, and anyone can investigate them themselves. After inspecting the automation scripts, it can be confirmed that the only thing it does is to open a web~browser with a qeury using the Search~Engine of choice, and downloading the result into a raw \texttt{HAR} file. It is essensially doing the same as explained in Methodology, visualized in \autoref{fig:metho:manually_har} and \autoref{fig:metho:export_har}, over and over again. Just to confirm, this process is not plagiarism, it is an automated process of collecting a large amount of data. And larger the dataset is, the more trustworthy is the results. Because any outliers or tendencies that do not follow the overall trend tend to be neglected in a larger data collections.
|
||||
|
||||
|
||||
|
||||
The working environment for this project was segregated into two steps. One for data collection, and one for data analysis.
|
||||
|
||||
|
||||
|
||||
\subsubsection*{Working environment - data collection}
|
||||
|
||||
|
||||
|
||||
The process of data collection used in this project is out of the scope for the learning materials at Noroff Academy. However, the EDA process plays a crucial role for this process. What parameters to show, how to retrieve data efficiently, etc. The count of different queries is twelve, meaning it would take too much time and resources to collect this data manually. Codex was used to create the automated scripts to automate the process of retrieving data, see sections \ref{sec:appen:auto}. However, the student has investigated the dataset, and anyone can investigate them themselves. After inspecting the automation scripts, it can be confirmed that the only thing it does is open a web~browser with a query using the Search~Engine of choice, and download the result into a raw \texttt{HAR} file. It is essentially doing the same as explained in Methodology, visualized in \autoref{fig:metho:manually_har} and \autoref{fig:metho:export_har}, repeatedly throughout the data collection process. A larger dataset may reduce the impact of isolated outliers or irregular observations that do not reflect the overall trend.
|
||||
|
||||
|
||||
|
||||
\subsubsection*{Working environment - data analysis}
|
||||
|
||||
|
||||
For this part, Power~Query is used to perform data analysis of large dataset. With twelve different queries done once in each browser of choice and Search~Engine of choice, the dataset would be too enormous for use in an ordinary Excel sheet. Power~BI is created to handle such large data set. First performing data table convention, and them merging them into one large dataset. When one HAR file could contain entries between 2 and 500, approximately, it is reasonable that such large dataset is too big for Excel to handle. Excel only handles about 200 entries before it starts to lag on a normal student's computer. Power~BI is created to handle such large dataset, and that is the reason for performing data analysis in Power~BI. The automation scripts do not filter for which proxy used and which browser used. This means a student could prompt the automation scripts to use a specific proxy and browser, but the output \texttt{HAR} files do not indicate any proxy nor browser of choice. That is why the \autoref{lst:appen:bash} was created to retrieve raw data in a controlled environment, sorting the results into different folders that segregates any proxy and browser of choice. Each \texttt{HAR} files are stored into a folder \texttt{./data/} inside the case folder because each case created 48 \texttt{HAR} files. By sorting all \texttt{HAR} files into the folder \texttt{./data/} inside each case folder, running the python-scripts became much cleaner when visualizing the content for each case folder and later uploading the data into separate tables in Power~BI.
|
||||
|
||||
|
||||
|
||||
\autoref{lst:appen:pq} is an example for how one table was read into Power~BI, it was done once for each four cases that make up for using two browsers and two proxies. The methodology is identical for each four cases. \autoref{lst:appen:pq} performs data cleaning and data conversion. Power~BI automatically recognized data type when a new data source is loaded into Power~BI. So the only thing needed to be done was to make sure that all convention is finished, and the duplicate the query, and change data source according to which table to read into Power~BI. For each query, two columns were added. Each to hardcode the Proxy of case, and the Browser of case. This is important, because \autoref{lst:appen:merge} merges all queries into one large table that is used to create all the figures in \autoref{sec:resul}. If the dataset did not have any hardcoded variable for Proxy and Browser, it would not be possible to filter for those variables when creating those figures.
|
||||
|
||||
|
||||
Power~BI allows for only to select variables of interest when creating figures. This feature saves time used for data analysis, because the student could just ignore any parameter of no interest for the case of this work. And only chose parameters that is important for this work when creating figures. For example, parameter such as \texttt{Proxy}, \texttt{Browser}, \texttt{Search Engine} and \texttt{tracking\_hint} are essential for this work. In contrast, parameters such as \texttt{time\_ms} and \texttt{startedDateTime} are not essential for the results.
|
||||
|
||||
|
||||
|
||||
Due to the size and complexity of the dataset, the scope of the analysis had to be narrowed to selected variables and trends relevant to the research topic. For example, some parameters would be interesting to analyse how it trends with \texttt{tracking\_hints}. This work has already created many figures, and creating to many figures would only be overwhelming and not supporting the overall findings of this work. So next time, it would be interesting to look into parameter \texttt{is\_third\_party\_domain}, and perform visualization to find out how it trends with other parameters. For this work, the parameter \texttt{is\_third\_party\_domain} is only used in \autoref{sec:intro} as an introduction to what the trends may indicate.
|
||||
|
||||
% hvorfor HAR-filer ble brukt
|
||||
% hvorfor cookies/\texttt{tracking\_hints} ble valgt
|
||||
% hvorfor EDA var relevant
|
||||
% hvorfor visualiseringer var nyttige
|
||||
% hvorfor controlled environments var nødvendig
|
||||
% proxy/Tor som metodevalg
|
||||
% reliability limitations
|
||||
% preprocessing challenges
|
||||
% noisy/unstructured data
|
||||
|
||||
% Dette matcher veldig godt pensum om:
|
||||
|
||||
|
||||
|
||||
\subsection{Discussion of the Result\label{sec:discu:result}}
|
||||
|
||||
|
||||
|
||||
|
||||
Now that the dataset, working environment, and parameters have been discussed, the findings and results of this work can be further analyzed. This includes discussing how the findings may affect users and what can be learned from the statistics presented in this work.
|
||||
|
||||
|
||||
All discussion in this section points to the figures in \autoref{sec:resul}. DuckDuckGo would seem to identify a lot of instances for any \texttt{tracking\_hints}, while other Search~Engines do not indicate as many instances as DuckDuckGo on \texttt{tracking\_hints}. However, as discussed in \autoref{sec:discu:choices}, that does not mean that DuckDuckGo leaks user information. It means that DuckDuckGo is the most aggressive Search~Engine towards addressing possible hints to tracking. Keep in mind that all these results came only from performing single web-queries, and not doing web-browser for a longer period of time. And all instances were done with a clean slate of cookies. Meaning web-browser and Search~Engines had no previous data on you. That is due to Playwright performs the web-query with no previous data when a new query instances is started. This is more efficient than manually deleting cookies between each query, both for each Browser and Search~Engine. In total 192 web-queries were performed for this work. That is why the automation scripts came in handy.
|
||||
|
||||
|
||||
One aspect that can be learned from the results is how data security relates to web tracking and cookies. Google tries to track activity on web even when you submit do not share any cookies, which the student did by habit. What's common is that using tools that promote security actually work. Take for instance \autoref{fig:appen:08_google-cookies} in \autoref{sec:appen:figures} shows that using Firefox actually reduces cookies on entries with \texttt{tracking\_hints}, and overall Firefox caches fewer \texttt{tracking\_hints} than Chromium. And using a Proxy such as Tor does reduce cookies on \texttt{tracking\_hints}. Overall, only Brave and DuckDuckGo do not return any cookies on \texttt{tracking\_hints}, while Goolge and Bing do.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
% reliability
|
||||
% data quality
|
||||
% error handling
|
||||
% EDA
|
||||
% communication of findings
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
% 4.2 Discussion of Results
|
||||
|
||||
% Her diskuterer du:
|
||||
|
||||
% hva resultatene faktisk viste
|
||||
% forskjeller mellom Google/Bing vs DDG/Brave
|
||||
% proxy-effekten
|
||||
% \texttt{tracking\_hints} uten cookies
|
||||
% patterns
|
||||
% anomalies
|
||||
% real-world implications
|
||||
|
||||
@@ -1,5 +1,21 @@
|
||||
\clearpage
|
||||
\section{Conclusion\label{sec:concl}}
|
||||
|
||||
This section concludes the overall work throughout this paper.
|
||||
|
||||
%From the discussion in \autoref{sec:discu} regarding the findings in \autoref{sec:resul}, it can be concluded that Google appears to be a less privacy-focused Search~Engine when it comes to tracking-related behavior, and Bing follows Google just behind. Brave and DuckDuckGo were the only ones that respected privacy in tracking-related behavior, and DuckDuckGo aggressively identifies more tracking behavior than all the other cookies. After all, there is no privacy leakage related to that due to cookies leakage on DuckDuckGo.
|
||||
|
||||
|
||||
|
||||
From the discussion in \autoref{sec:discu} regarding the findings in \autoref{sec:resul}, it can be concluded that Google appears to be a less privacy-focused Search~Engine when it comes to tracking-related behavior, with Bing following closely behind. Brave and DuckDuckGo were the only Search~Engines that consistently respected privacy in terms of tracking-related behavior. DuckDuckGo also identified significantly more instances of \texttt{tracking\_hints} than the other Search~Engines. However, despite these detections, DuckDuckGo did not return cookies related to those entries, indicating that identifying potential tracking-related behavior does not necessarily imply privacy leakage through cookies. Using Firefox rather than Chromium and a Tor proxy also reduces instances of cookies related to \texttt{tracking\_hints}.
|
||||
|
||||
|
||||
%The student's hypothesis were partly true. The student though that using tools related to privacy-focus would reduce tracking instances, which was right. However, what surprised is that DuckDuckGo had a high count to \texttt{tracking\_hints}, however no cookies were related to those instaces.
|
||||
|
||||
The student's hypothesis was partly correct. The student expected that using privacy-focused tools would reduce tracking-related instances, which proved to be correct. However, it was surprising that DuckDuckGo returned a high count of \texttt{tracking\_hints}, while no cookies were associated with those instances. For future work, it would be relevant to further investigate parameter \texttt{is\_third\_party\_domain}, as mentioned in \autoref{sec:discu}
|
||||
|
||||
|
||||
|
||||
|
||||
% \subsection{Summary}
|
||||
|
||||
|
||||
@@ -1,6 +1,58 @@
|
||||
\section{Appendices\label{sec:appen}}
|
||||
|
||||
|
||||
\subsection{Additional visualization on Browser specific cookies.\label{sec:appen:figures}}
|
||||
|
||||
|
||||
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/05_bing-cookies.png}
|
||||
\caption{Cookies comparison in Bing}
|
||||
\label{fig:appen:05_bing-cookies}
|
||||
\end{figure}
|
||||
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/06_brave-cookies.png}
|
||||
\caption{Cookies comparison in Brave}
|
||||
\label{fig:appen:06_brave-cookies}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/07_ddg-cookies.png}
|
||||
\caption{Cookies comparison in DuckDuckGo}
|
||||
\label{fig:appen:07_ddg-cookies}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[
|
||||
width=\linewidth,
|
||||
]{figures/png/08_google-cookies.png}
|
||||
\caption{Cookies comparison in Google}
|
||||
\label{fig:appen:08_google-cookies}
|
||||
\end{figure}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\subsection{Automation scripts\label{sec:appen:auto}}
|
||||
|
||||
|
||||
@@ -52,3 +104,6 @@
|
||||
label={lst:appen:merge}
|
||||
]{scripts/power_query_merge.txt}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user