report finished

This commit is contained in:
2026-05-28 17:48:53 +02:00
parent 78ce34d1bc
commit 77be37b8f5
9 changed files with 185 additions and 15 deletions

2
.gitignore vendored
View File

@@ -9,7 +9,7 @@
# PDF (valgfritt)
*.pdf
!main.pdf
# Temporary
*.blg
*.bbl

View File

@@ -1 +0,0 @@
{"rule":"WANT_TO_NN","sentence":"^\\QThe main priority is the variable Trackinghints, as the work tries to identity correspondanses between trackings and cookies.\\E$"}

View File

@@ -28,7 +28,7 @@
}
],
"ltex.language": "en-GB",
"ltex.language": "en-US",
"ltex.enabled": [
"latex"
],

BIN
report/main.pdf Normal file

Binary file not shown.

View File

@@ -53,7 +53,15 @@
\begin{abstract}
\label{abs:abstr}
\centering
\lipsum[1]
%This report aims to find trends in Web~Search activity connected to \texttt{tracking\_hints} and cookies. Playwright is used automate the process of performing web-queries. Data is downloaded as \texttt{HAR}, and transformed to \texttt{CSV} files using automation scripts. \texttt{CSV} files are loaded into Power~BI, to be transformed and visualized using Power~Query. It is found that private-focused tools tend to respect privacy in web-queries. And at last, Google leaks all user privacy.
This report aims to find trends in Web~Search activity connected to \texttt{tracking\_hints} and cookies. Playwright is used to automate the process of performing web-queries. Data is downloaded as \texttt{HAR} files and transformed into \texttt{CSV} files using automation scripts. The \texttt{CSV} files are loaded into Power~BI, where the data is transformed and visualized using Power~Query. It is found that privacy-focused tools tend to better respect privacy in web-queries, while Google shows significantly more tracking-related behavior and privacy leakage compared to the other Search~Engines.
\end{abstract}
\clearpage
\tableofcontents

View File

@@ -1,6 +1,25 @@
\section{Introduction\label{sec:intro}}
\begin{figure}[h]
\subsection*{Background}
This project was delivered from Noroff Academy. A student could either choose to work in a group, or work independently. The topic for the project could either be chosen by professor and teacher at Noroff Academy, or the student could choose their own topic.
For this scope, the student chose to work independently with a self-chosen topic. The student has experienced that work may lose interest if there is nothing driving the curiosity behind the project. For that reason, the topic for this work is to find trends in \texttt{tracking\_hints} in web-queries. The student has a strong interest in networking, system administration, and Linux-based environments through personal projects and private experimentation. That is why this topic is chosen, to make use of personal skills in an academic work. Overall, the main work environment and method used in this project are thought in Noroff Academy. This aspect of the work creates motivation to complete the project, because the topic is of personal interest to the student.
As visualized in \autoref{fig:intro:01_thirdparty-domains}, which presents entries returning \texttt{tracking\_hints=yes}, Google was the only Search~Engine that consistently interacted with third-party domains. This finding is significant because Google also heavily used cookies on entries associated with \texttt{tracking\_hints}, as further discussed throughout this paper.
\subsection*{Hypothesis}
The student's hypothesis for this work is that DuckDuckGo and Brave will return fewer instances of \texttt{tracking\_hints} and cookies. On the other hand, the student expects Google and Bing to use cookies alongside \texttt{tracking\_hints}, as these Search~Engines are known for tracking user activity and providing personalised advertisements in the web browser.
\begin{figure}[H]
\centering
\includegraphics[
width=0.95\linewidth,
@@ -9,10 +28,12 @@
\label{fig:intro:01_thirdparty-domains}
\end{figure}
\subsection{Background\label{sec:intro:background}}
\subsection{Problem statement\label{sec:intro:statement}}
%\subsection{Research objectives\label{sec:intro:research}}
\subsection{Hypotheses\label{sec:intro:hypotheses}}

View File

@@ -39,7 +39,7 @@ Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} e
\begin{figure}[h]
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,

View File

@@ -1,10 +1,136 @@
\section{Discussion\label{sec:discu}}
%%% CAPTCHA med google!! derfor pdf 04
% \subsection{Interpretation of findings}
Throughout this paper, several statements and choices have been made without been accounted for in earlier sections. This section seeks to explain and connect all the loose threads. Why is \texttt{HAR} files used, and fully explain the order of methodology, and every part of it. Lastly, results will be explained in this section, in \autoref{sec:discu:result}.
%%% Differences in in Search Engines and browser
% \subsection{Privacy implications}
%%% Proxy reduces the cookies lest behind on the web. Tacking hint is going down.
% \subsection{Reliability and limitations}
% Use of AI.
% \subsection{Ethical considerations}
% 4.1 Discussion of Methodology and Theoretical Choices
\subsection{Discussion of Methodology and Theory\label{sec:discu:choices}}
% Her diskuterer du:
As explained, this work aims to connect the trends between \texttt{tracking\_hint} and cookies in web~queries. There are many ways to approach this. Each web~queries had to be done anonymously. Any cookies stored in a web-browser leaks personalized data for the current user when performing a web~query. To make process of collection data, EDA, as anonymously as possible, a new browser profile was created for this cause. However, for each web-query, the cookies in this new browser profile had to be deleted. To make this as anonymous as possible, all DNS blocking and personalized configuration for student's private network was disabled for a limited period of time. The student's personal computer was used in this EDA process. The only data that might point to the student is such as the official IP addresses. Such IP addresses are dynamically assigned from the IPS provider, so that is not a crucial leakage of data.
The only data parameters visualized in Power~BI is those of importance for supporting the student's theories. Such parameters as \texttt{Request Cookies}, \texttt{Response Cookies} and \texttt{tracking\_hint}.
\subsubsection*{Dataset parameters}
For this course only entries for when \texttt{tracking\_hints=yes} are inspected. Any entry for when \texttt{tracking\_hints=no}, is filtered out. This is due to main problem on this work, to find trends on web~queries when the entry hints to tracking. This means all other entries are of no interest for this work.
The parameter \texttt{tracking\_hint} indicates any entry on web~requests or responses that could be tracking or not. Might not be tracking, but the entry thinks that it may be tracking. This means the number of instances of \texttt{tracking\_hints} does not reflect how much tracking that is actually queried on that one web~query. In contrast, if a Search~Engines many \texttt{tracking\_hints}, it may reflect that Search~Engine used may be paranoid toward any hints for tracking. Likely when a motorcyclist may be paranoid on any car on the road. This is a good thing, because it allows the user to inspect those parameter that hints to tracking, and inspect what source the entry has. However, what is worth noticing is when tracking hint shows cookies requests and responses on the same entries. Cookies are data of information on the current session of the user using the web~browser. Meaning if an entry returns tracking hint = yes and no cookies, then the server on the other and where the request is retrieved from has no data on you. In contrast, if the entry return both cookies and \texttt{tracking\_hints} = yes, then this means that the server on the other end that might track your activity on the web has personal data on you from the web queries.
Meaning DuckDuckGo has a high occurrence rate of \texttt{tracking\_hints} that could have indicated to that the Search~Engine did not filter out any entries that may point tracking. However, that is not the case, this high occurrence rate actually addresses all cases that could be tracking, that other Search~Engines did not detect. So a high occurrences rate on \texttt{tracking\_hints} does not reflect your data actually being tracked. It is only when the entries where tracking hint is equal to yes, and the current entry returns cookie count, that your data could be tracked. From figures in \autoref{sec:resul}, only entries from Search~Engines such as Google and Bing return cookies on entries where \texttt{tracking\_hints} is equal to yes. When preform data collection, the student by habit rejected cookies on Google Search~Engines, and still the Search~Engine returned cookies, while DuckDuckGo and Brave did not return any cookies, on neither requests nor responses.
\subsubsection*{Tor proxy}
Tor proxy was used to help hide the user's identity on the web. As explained in \autoref{sec:theor:proxy}, a proxy routes traffic through an intermediary server before accessing the web. Tor proxy tunnels traffic through several relay servers, making direct identification of the user more difficult. This is because the destination web-server does not see the user's original IP address, but instead observes the IP address of the Tor exit relay.
The results show that fewer entries on cookies for \texttt{tracking\_hints} when using a Tor proxy. This is implicit due to that web servers are able to identify Tor exit relay by its IP address, so any tracking do not point to the user. So any malware, or unethical hackers are not able to track your activity on web. However, normal websites are able to identify any Tor exit relay. For example, finn.no block any incoming requests from a Tor exit relay. This is due to that Tor is not a trusted exit relay. Trusted exit relay are normally connected to home routers to any normal household using, for example, Telenor as ISP. That is because anyone accessing the web through Telenor's ISP is automatically assumed to be a Norwegian human being accessing the web. What's interesting, even though when a Tor proxy is used, only Google and Bing Search~Engines returns cookies on requests, and only Google returns cookies on response as well. Meaning Google uses cookies still when user is using proxy.
%%Tor
\subsubsection*{Working environment} %% such as Playwright, why it was used instead of manually performing web queires
%The working environment for this project was segregated into two steps. One for data collection, and one for data analysis. The process of data collection used in this project is out of the scope for the learning materials at Noroff Academy. However, the EDA process plays a crucial role for this process. What parameters to show, how to retrieve data efficiently, etc. The count of different queries is twelve, meaning, it would take too much time resources to collect this data doing it manually. Codex was used to create the automated scripts to automate the process of retrieving data, see sections \ref{sec:appen:auto}. However, the student has investigated the dataset, and anyone can investigate them themselves. After inspecting the automation scripts, it can be confirmed that the only thing it does is to open a web~browser with a qeury using the Search~Engine of choice, and downloading the result into a raw \texttt{HAR} file. It is essensially doing the same as explained in Methodology, visualized in \autoref{fig:metho:manually_har} and \autoref{fig:metho:export_har}, over and over again. Just to confirm, this process is not plagiarism, it is an automated process of collecting a large amount of data. And larger the dataset is, the more trustworthy is the results. Because any outliers or tendencies that do not follow the overall trend tend to be neglected in a larger data collections.
The working environment for this project was segregated into two steps. One for data collection, and one for data analysis.
\subsubsection*{Working environment - data collection}
The process of data collection used in this project is out of the scope for the learning materials at Noroff Academy. However, the EDA process plays a crucial role for this process. What parameters to show, how to retrieve data efficiently, etc. The count of different queries is twelve, meaning it would take too much time and resources to collect this data manually. Codex was used to create the automated scripts to automate the process of retrieving data, see sections \ref{sec:appen:auto}. However, the student has investigated the dataset, and anyone can investigate them themselves. After inspecting the automation scripts, it can be confirmed that the only thing it does is open a web~browser with a query using the Search~Engine of choice, and download the result into a raw \texttt{HAR} file. It is essentially doing the same as explained in Methodology, visualized in \autoref{fig:metho:manually_har} and \autoref{fig:metho:export_har}, repeatedly throughout the data collection process. A larger dataset may reduce the impact of isolated outliers or irregular observations that do not reflect the overall trend.
\subsubsection*{Working environment - data analysis}
For this part, Power~Query is used to perform data analysis of large dataset. With twelve different queries done once in each browser of choice and Search~Engine of choice, the dataset would be too enormous for use in an ordinary Excel sheet. Power~BI is created to handle such large data set. First performing data table convention, and them merging them into one large dataset. When one HAR file could contain entries between 2 and 500, approximately, it is reasonable that such large dataset is too big for Excel to handle. Excel only handles about 200 entries before it starts to lag on a normal student's computer. Power~BI is created to handle such large dataset, and that is the reason for performing data analysis in Power~BI. The automation scripts do not filter for which proxy used and which browser used. This means a student could prompt the automation scripts to use a specific proxy and browser, but the output \texttt{HAR} files do not indicate any proxy nor browser of choice. That is why the \autoref{lst:appen:bash} was created to retrieve raw data in a controlled environment, sorting the results into different folders that segregates any proxy and browser of choice. Each \texttt{HAR} files are stored into a folder \texttt{./data/} inside the case folder because each case created 48 \texttt{HAR} files. By sorting all \texttt{HAR} files into the folder \texttt{./data/} inside each case folder, running the python-scripts became much cleaner when visualizing the content for each case folder and later uploading the data into separate tables in Power~BI.
\autoref{lst:appen:pq} is an example for how one table was read into Power~BI, it was done once for each four cases that make up for using two browsers and two proxies. The methodology is identical for each four cases. \autoref{lst:appen:pq} performs data cleaning and data conversion. Power~BI automatically recognized data type when a new data source is loaded into Power~BI. So the only thing needed to be done was to make sure that all convention is finished, and the duplicate the query, and change data source according to which table to read into Power~BI. For each query, two columns were added. Each to hardcode the Proxy of case, and the Browser of case. This is important, because \autoref{lst:appen:merge} merges all queries into one large table that is used to create all the figures in \autoref{sec:resul}. If the dataset did not have any hardcoded variable for Proxy and Browser, it would not be possible to filter for those variables when creating those figures.
Power~BI allows for only to select variables of interest when creating figures. This feature saves time used for data analysis, because the student could just ignore any parameter of no interest for the case of this work. And only chose parameters that is important for this work when creating figures. For example, parameter such as \texttt{Proxy}, \texttt{Browser}, \texttt{Search Engine} and \texttt{tracking\_hint} are essential for this work. In contrast, parameters such as \texttt{time\_ms} and \texttt{startedDateTime} are not essential for the results.
Due to the size and complexity of the dataset, the scope of the analysis had to be narrowed to selected variables and trends relevant to the research topic. For example, some parameters would be interesting to analyse how it trends with \texttt{tracking\_hints}. This work has already created many figures, and creating to many figures would only be overwhelming and not supporting the overall findings of this work. So next time, it would be interesting to look into parameter \texttt{is\_third\_party\_domain}, and perform visualization to find out how it trends with other parameters. For this work, the parameter \texttt{is\_third\_party\_domain} is only used in \autoref{sec:intro} as an introduction to what the trends may indicate.
% hvorfor HAR-filer ble brukt
% hvorfor cookies/\texttt{tracking\_hints} ble valgt
% hvorfor EDA var relevant
% hvorfor visualiseringer var nyttige
% hvorfor controlled environments var nødvendig
% proxy/Tor som metodevalg
% reliability limitations
% preprocessing challenges
% noisy/unstructured data
% Dette matcher veldig godt pensum om:
\subsection{Discussion of the Result\label{sec:discu:result}}
Now that the dataset, working environment, and parameters have been discussed, the findings and results of this work can be further analyzed. This includes discussing how the findings may affect users and what can be learned from the statistics presented in this work.
All discussion in this section points to the figures in \autoref{sec:resul}. DuckDuckGo would seem to identify a lot of instances for any \texttt{tracking\_hints}, while other Search~Engines do not indicate as many instances as DuckDuckGo on \texttt{tracking\_hints}. However, as discussed in \autoref{sec:discu:choices}, that does not mean that DuckDuckGo leaks user information. It means that DuckDuckGo is the most aggressive Search~Engine towards addressing possible hints to tracking. Keep in mind that all these results came only from performing single web-queries, and not doing web-browser for a longer period of time. And all instances were done with a clean slate of cookies. Meaning web-browser and Search~Engines had no previous data on you. That is due to Playwright performs the web-query with no previous data when a new query instances is started. This is more efficient than manually deleting cookies between each query, both for each Browser and Search~Engine. In total 192 web-queries were performed for this work. That is why the automation scripts came in handy.
One aspect that can be learned from the results is how data security relates to web tracking and cookies. Google tries to track activity on web even when you submit do not share any cookies, which the student did by habit. What's common is that using tools that promote security actually work. Take for instance \autoref{fig:appen:08_google-cookies} in \autoref{sec:appen:figures} shows that using Firefox actually reduces cookies on entries with \texttt{tracking\_hints}, and overall Firefox caches fewer \texttt{tracking\_hints} than Chromium. And using a Proxy such as Tor does reduce cookies on \texttt{tracking\_hints}. Overall, only Brave and DuckDuckGo do not return any cookies on \texttt{tracking\_hints}, while Goolge and Bing do.
% reliability
% data quality
% error handling
% EDA
% communication of findings
% 4.2 Discussion of Results
% Her diskuterer du:
% hva resultatene faktisk viste
% forskjeller mellom Google/Bing vs DDG/Brave
% proxy-effekten
% \texttt{tracking\_hints} uten cookies
% patterns
% anomalies
% real-world implications

View File

@@ -1,5 +1,21 @@
\clearpage
\section{Conclusion\label{sec:concl}}
This section concludes the overall work throughout this paper.
%From the discussion in \autoref{sec:discu} regarding the findings in \autoref{sec:resul}, it can be concluded that Google appears to be a less privacy-focused Search~Engine when it comes to tracking-related behavior, and Bing follows Google just behind. Brave and DuckDuckGo were the only ones that respected privacy in tracking-related behavior, and DuckDuckGo aggressively identifies more tracking behavior than all the other cookies. After all, there is no privacy leakage related to that due to cookies leakage on DuckDuckGo.
From the discussion in \autoref{sec:discu} regarding the findings in \autoref{sec:resul}, it can be concluded that Google appears to be a less privacy-focused Search~Engine when it comes to tracking-related behavior, with Bing following closely behind. Brave and DuckDuckGo were the only Search~Engines that consistently respected privacy in terms of tracking-related behavior. DuckDuckGo also identified significantly more instances of \texttt{tracking\_hints} than the other Search~Engines. However, despite these detections, DuckDuckGo did not return cookies related to those entries, indicating that identifying potential tracking-related behavior does not necessarily imply privacy leakage through cookies. Using Firefox rather than Chromium and a Tor proxy also reduces instances of cookies related to \texttt{tracking\_hints}.
%The student's hypothesis were partly true. The student though that using tools related to privacy-focus would reduce tracking instances, which was right. However, what surprised is that DuckDuckGo had a high count to \texttt{tracking\_hints}, however no cookies were related to those instaces.
The student's hypothesis was partly correct. The student expected that using privacy-focused tools would reduce tracking-related instances, which proved to be correct. However, it was surprising that DuckDuckGo returned a high count of \texttt{tracking\_hints}, while no cookies were associated with those instances. For future work, it would be relevant to further investigate parameter \texttt{is\_third\_party\_domain}, as mentioned in \autoref{sec:discu}
% \subsection{Summary}