Welcome to DataSecurity, this lab is a basic demonstration of a bad hashing algorithm versus a best practice algorithm
Last Updated: 2024-01-08
Hashes are used to help identify integrity of a system. This lab demonstrates what a hash algorithm should not behave like
We will cover the following topics:
Hashing is the process of converting input of any size into a fixed-size string of text, using a mathematical function known as a hash function. The output, known as a hash value or hash code, typically looks like a random string of characters.
While hash functions are designed to minimize collisions (situations where different inputs produce the same output), they're not entirely avoidable due to the fixed size of the hash output. Good hash functions make these collisions extremely rare.
Seeing is believing
Please open this in a separate window (in Chrome) https://austriandatalab.github.io/webvm/
First some basics
user@:~$ cd examples/hashing user@:~/examples/hashing$ cat txt*
You should now see that there are 3 very similar entries in those 3 text files:
Now, lets check with the first hash function md5
and now it's your job to explain why txt2 and txt3 differ:
user@:~/examples/hashing$ md5 txt* f51996e00c0d41e8958e01982bb2232b txt1 d48ff54287ef1a430b076c2d942e5040 txt2 a8c81c60e0c485be93c4d1e991263695 txt3
In the same folder, you find two pdfs, let's have a quick look:
user@:~/examples/hashing$ ls -lah | grep pdf -rw-r--r-- 1 root root 413K Jan 8 13:10 shattered-1.pdf -rw-r--r-- 1 root root 413K Jan 8 13:10 shattered-2.pdf
Ok, now lets create the hash of those two, as well . This time we use
user@:~/examples/hashing$ shasum shattered* 38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-1.pdf 38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-2.pdf
This means, they are the same, right?
Exercise: the two files are on our github https://github.com/AustrianDataLAB/webvm/blob/features/hashinglab/examples/hashing/shattered-1.pdf and https://github.com/AustrianDataLAB/webvm/blob/features/hashinglab/examples/hashing/shattered-2.pdf
You don't have to download them, but you can in-browser view them. Would you say they are the same?
user@:~/examples/hashing$ shasum -a 256 shattered* 2bb787a73e37352f92383abe7e2902936d1059ad9f1ba6daaa9c1e58ee6970d0 shattered-1.pdf d4488775d29bdef7993367d541064dbdda50d383f89f0aa13a6ff2e0894ba5ff shattered-2.pdf
And here we go: they are not the same, so our eyes work just fine.
SQLite is a software library that provides a relational database management system (RDBMS). It is light-weight in terms of setup, database administration, and required resources.
Here are some key characteristics of SQLite:
SQLite is often used as a local database for small to medium-sized applications, as well as for development and testing. For larger applications where multiple users or high volumes of data are expected, a larger database system like MySQL, PostgreSQL, or SQL Server might be more appropriate.
Play around, you can't destroy anything
Please open this in a separate window (in Chrome) https://austriandatalab.github.io/webvm/
Don't press refresh unless you want to completely wipe all progress.
Get started in the webVM
user@:~$ cd examples/db user@:~/examples/db$ sqlite3 chinook.db
.tables
.schema albums
Here we can observe the following
SELECT
*
FROM albums LIMIT
10;
SELECT albums.Title, artists.Name FROM albums INNER JOIN artists ON albums.ArtistId = artists.ArtistId LIMIT
10;
INSERT INTO artists (Name)
VALUES
('New Artist');
UPDATE artists SET
Name
=
'Updated Artist'
WHERE
Name
=
'New Artist';
DELETE
FROM artists WHERE
Name
=
'Updated Artist';
.quit
The chinook database is a good example of a normalized data model. It has separate tables for artists, albums, tracks, genres, media types, playlists, customers, invoices, and employees. Each table has a primary key, and there are foreign keys to link related records across tables.
For example, to get the details of a track, you might use a query like this:
SELECT
tracks.Name AS Track, albums.Title AS Album, artists.Name AS Artist, genres.Name AS Genre
FROM
tracks
INNER JOIN albums ON tracks.AlbumId = albums.AlbumId
INNER JOIN artists ON albums.ArtistId = artists.ArtistId
INNER JOIN genres ON tracks.GenreId = genres.GenreId
WHERE
tracks.TrackId =
1;
This query performs well because it's able to use the relational structure of the database to efficiently gather data from multiple tables.
A bad data model might involve denormalization that leads to data redundancy and update anomalies. For example, imagine if the chinook database had a single table where each row contained all the details of a track, including the album title, artist name, and genre name. This would mean that the same album title, artist name, and genre name would be repeated in many rows.
Lets create the table first:
CREATE
TABLE
denormalized_tracks (
TrackId INTEGER
PRIMARY KEY,
TrackName TEXT
NOT NULL,
AlbumTitle TEXT,
ArtistName TEXT,
GenreName TEXT
);
Now, insert data into it
INSERT INTO denormalized_tracks (TrackId, TrackName, AlbumTitle, ArtistName, GenreName)
SELECT
tracks.TrackId, tracks.Name, albums.Title, artists.Name, genres.Name
FROM
tracks
INNER JOIN albums ON tracks.AlbumId = albums.AlbumId
INNER JOIN artists ON albums.ArtistId = artists.ArtistId
INNER JOIN genres ON tracks.GenreId = genres.GenreId;
Now, we have a table that give us a lot of flexibility (seemingly)
SELECT
TrackName, AlbumTitle, ArtistName, GenreName
FROM
denormalized_tracks
WHERE
TrackId =
1;
While this query might seem simpler, the underlying data model is inefficient. The redundancy in the data increases the size of the database and the risk of inconsistencies. Also, updates to the data (like changing the name of an artist) would require changes in many places, which can be slow and error-prone.
Think about, e.g. how would you update a typo in the GenreName for the denomalized model? (select * from denormalized_tracks limit 10)
Can you draw on paper how the update statement would be different for a denormalized vs a normalized model?
Comment: There are databases specialized in the exact opposite (e.g. column databases such as SAP Hana would prefer the denomalized model for a results table, because its engine is designed for it).
Here are some examples to inspire you, feel free to change them
CREATE
TABLE
Patients
(
PatientId INTEGER
PRIMARY KEY,
FirstName TEXT
NOT NULL,
LastName TEXT
NOT NULL,
DateOfBirth TEXT,
Gender TEXT
);
CREATE
TABLE
Treatments
(
TreatmentId INTEGER
PRIMARY KEY,
TreatmentName TEXT
NOT NULL,
Description
TEXT
);
CREATE
TABLE
Researchers
(
ResearcherId INTEGER
PRIMARY KEY,
FirstName TEXT
NOT NULL,
LastName TEXT
NOT NULL,
Department TEXT
);
CREATE
TABLE
Studies
(
StudyId INTEGER
PRIMARY KEY,
PatientId INTEGER,
TreatmentId INTEGER,
ResearcherId INTEGER,
Result TEXT,
Date
TEXT,
FOREIGN KEY
(PatientId)
REFERENCES Patients (PatientId),
FOREIGN KEY
(TreatmentId)
REFERENCES Treatments (TreatmentId),
FOREIGN KEY
(ResearcherId)
REFERENCES Researchers (ResearcherId)
);
SQLite doesn't have built-in data masking features like some other database systems. However, you can achieve a similar effect by creating views that exclude the sensitive columns.
For example, if the Patients table has a sensitive column DateOfBirth, you can create a view that includes all columns except this one:
CREATE
VIEW
Patients_Public
AS
SELECT PatientId, FirstName, LastName, Gender
FROM Patients;
Now, you can query Patients_Public instead of Patients to get patient information without revealing any dates of birth:
SELECT
*
FROM Patients_Public;
Remember, views do not provide security against unauthorized access. They are just a convenience for queries. To protect sensitive data, you should implement proper access controls at the database or application level.
For more advanced data masking, you would typically use a more feature-rich database system, or handle the masking at the application level. For example, you could replace sensitive data with placeholders or hashed values before inserting it into the database.
Discuss if the following is a good idea:
CREATE
TABLE
AnonymizedPatients
AS
SELECT
PatientId,
'Patient '
|| PatientId AS FirstName,
NULL
AS LastName,
NULL
AS DateOfBirth,
Gender
FROM
Patients;
Please see
https://www.enisa.europa.eu/publications/data-protection-engineering
https://www.enisa.europa.eu/publications/deploying-pseudonymisation-techniques
Congratulations, you've successfully completed this training on hashing collisions and some datamodelling.