Files
NoroffExam/report/sections/02_method.tex
2026-05-15 14:30:58 +02:00

135 lines
5.7 KiB
TeX

\section{Method\label{sec:metho}}
%This section describes the methodology used throughout the research process. Some technical concepts and terminology referenced in this section are further explained in the Theory section and later discussed in the Discussion section.
This section describes the methodology used in this research. Any technical concepts and terminology references in this section are further explained in Section~\ref{sec:theor}, and discussed in Section~\ref{sec:discu}.
\subsection{Research design\label{sec:metho:research_design}}
% Stikkord:
% observational study
% browser-based network measurements
% same searches across search engines
% comparison between search engines, browsers, and network modes
%This research is design using tools to simulate human interaction with simple web searches. Each search is design to be anomynous, with no browser histories and cookies before each search was done. For each web search the browser history is cleaned and cookies removed. Browser profile used has no login data, so the web queries can not connect the search to any real person.
%4 Search Engines are used for this prosess, once in each webbrowser, Firefox and Chromium. Which means each web query is don 8 times. For this work, several queries are create to widen the data collection. The web quires are following:
This research is designed using tools to simulate human interaction with simple web searches. Each search is designed to be anonymous, with no browser history or cookies stored before the search is performed. Before each search, the browser history is cleared and cookies are removed. The browser profiles used contain no login data, preventing the web queries from being connected to any real person.
Four search engines are used in this process (Brave, Bing, DuckDuckGo and Google), once in each web browser: Firefox and Chromium. This means that each web query is performed eight times. For this work, several queries are created to widen the data collection. The web queries are as follows:
\begin{multicols}{2}
\begin{itemize}[noitemsep, topsep=0pt]
\item weather oslo
\item migraine symptoms
\item vitamin d deficiency
\item running shoes
\item coffee grinder
\item best laptop for students
\item electric car charging
\item cheap flights to london
\item home insurance
\item python list tutorial
\item banana bread recipe
\item news norway
\end{itemize}
\end{multicols}
Data collection is performed using either a Tor proxy to help hide the identity of the person performing the web searches, or a normal network connection where web traffic may be used to identify the user.
%ata collected is filtered using either Tor proxy to hide the identity of the person premforing websearch, and not using any proxy, where any web traffic can identify you traffic.
\subsection{Test environment\label{sec:metho:test_environment}}
% Stikkord:
% operating system / controlled environment
% Playwright
% Chromium and Firefox
% normal network and Tor proxy
% clean browser context
% cookies allowed
% same wait condition and timeout
%When you tap ctrl + shift + C
When pressing \texttt{Ctrl + Shift + C} and click on \texttt{Network}, a log of Network traffic shows up. This window is open and the web-history is emptied before performing a web-search manually. This process gives a clean anonymous log web traffic from only one web query as known in Figure~\ref{fig:metho:manually_har}. Right to \texttt{"No throttling"} is a settings icon. Clicking on that bottom gives the options on Figure~\ref{fig:metho:export_har}.
Each query could be done manually, or the processes of collecting data could be automated. For this research the process of collecting first-hand raw data was automated. A tool used to automate web-queries is python using Playwright \parencite{Playwright}. Playwright is installed in a virtual environment packages using python \parencite{VENV} \texttt{script/capture\_search\_har.py}
\begin{figure}[h]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/09_importing_har_manually.png}
\caption{Network traffic by a simply web search}
\label{fig:metho:manually_har}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[
width=0.27\linewidth,
]{figures/png/10_har_options.png}
\caption{Download Har files.}
\label{fig:metho:export_har}
\end{figure}
%\subsection{Search engines and search queries\label{sec:metho:search_engines}}
% Stikkord:
% Google
% Bing
% DuckDuckGo
% Brave Search
% list of search queries
% same query used across all engines
%\subsection{Variables and measurements\label{sec:metho:Variables_measurements}}
% Stikkord:
% requests_total
% unique_domains
% third_party_requests
% request_cookies_total
% response_cookies_total
% query_params_total
% post_requests_total
% tracking_hint_requests
% transferred_kb_approx
% page_load_ms
% HTTP status groups
\subsection{Data collection\label{sec:metho:data_collection}}
% Stikkord:
% HAR files
% one HAR file per search engine/query/browser/network mode
% capture_search_har script
% headed browser
% wait-until load
% timeout 60000 ms
% Tor via SOCKS proxy where applicable
\subsection{Data processing\label{sec:metho:data_processing}}
% Stikkord:
% HAR files converted to CSV
% har_entries.csv: one row per HAR entry/request
% har_summary.csv: one row per HAR file
% Power Query used to combine summary files
% folder names used to identify browser/network mode
\subsection{Limitations of the method\label{sec:metho:limitations}}
% Stikkord:
% HAR shows observable browser-side traffic only
% cannot prove server-side storage
% Playwright may differ from manual browsing
% Tor may change website behaviour
% cookie consent state affects results
% tracking_hint is keyword-based, not proof of tracking