Abstract |
In a networked world, content on the Web is blossoming and it is available to anyone
who has access to a computer system and the Internet. Accessing the Web,
is usually done through a search on a standard search engine, e.g. Google. But,
is it only what we see online or is something hidden underneath all that information?
The World Wide Web content which is not indexed by conventional search
engines, is referred to the Deep Web. This master thesis constitutes an approach to
explore several aspects of the DeepWeb concerning Personal Identifiable Information
(PII).
We conduct two immense privacy case studies that expose Personal Identifiable
Information inside the Deep Web. First, we examine database content as the
Deep Web. To this end, we highlight the privacy issues that have arisen from the
introduction of the Greek Social Security Number (AMKA), in connection with the
availability of personally identifiable information on Greek web sites. Second, we
conduct another case study that refers to documents’ metadata as Deep Web content.
We analyze the metadata stored in over fifteen million of documents (DOC,
PDF, XLS and PPT) found online and we present the privacy leaks that emerge
from the analysis.
Also, we present countermeasures that shield our digital life against disclosure
of sensitive information. We propose an information retrieval based method for information
leak detection which constitutes an improvement of cyclical hashing so
as to both accelerate leak detection and increase the accuracy of the result. Experiments
were conducted on real-world data to prove the efficiency and effectiveness
of the proposed solution.
|