Exploring semantic footprints of books
Lyrical footprints
A few years ago, Vox released a video about diagrams made from lyrics of songs which let you visualise their repetition. Here’s so-called “lyrical footprint” for Radioactive by Imagine Dragons:
How are these diagrams constructed? Imagine that the lyrics of the song run both horizontally and vertically across the top and side of these diagrams; when a word appears in two places different places in the song, that position is highlighted.
In order to avoid highlighting every occurrence of very common words (causing a checkerboard pattern across the whole diagram) only repeated phrases are actually highlighted. The diagram for “Radioactive” above won’t be lit up at every occurrence of “the”, but it will be lit up for the refrain “I’m radioactive, radioactive”.
However, one feature of these diagrams is that only literal repetitions of phrases highlighted, not repetitions in theme. No matter how similar the phrases are in meaning, they won’t be lit up. For example, in “A Day in the Life” by The Beatles, the first and third verse start like so:
I read the news today, oh boy
About a lucky man who made the grade
...and...
I saw a film today, oh boy
The English Army had just won the war
The opening lines “I read the news today, oh boy” and “I saw a film today, oh boy” are very semantically similar but wouldn’t be highlighted because they are not literally the same phrase. This is not really when making diagrams for songs but it becomes relevant if you want to create these diagrams for books. Books have a lot less literal repetition but there’s often a lot of repetition in themes or meaning across the text.
To solve this problem, one approach is to instead use vector embeddings of passages and compare the embeddings rather than the text. Vector embeddings use machine learning magic in order to map a piece of text to a long list of numbers which can then be compared with something like cosine similarity. word2vec was a famous example of this in 2013, where individual words were converted to vectors which somehow distilled their meaning.
For example, $\mathbf{king}$, $\mathbf{man}$ and $\mathbf{woman}$ could all be converted to lists of numbers and adding and subtracting these lists component-wise gives the remarkable equation:
\[\mathbf{king} - \mathbf{man} + \mathbf{woman} \approx \mathbf{queen}\](where “$\approx$” means that the word is the closest vector to the actual result).
Results
Here’s what the results look like. But first, some of the topographical features to look out for:
- Like with the lyrical footprints, every diagram is symmetric because if section A is similar to section B, then section B is equally similar to section A.
- Bands of darkness running vertically and horizontally are sections of the book that (according to the embeddings) are very dissimilar to other sections of the book. They don’t necessarily divide the book up into different chapters.
- Clumps of brightness around the diagonal are sections which are very self-similar.
- Splotches of brightness off the diagonal are places where to different parts of the text are similar.
The left hand side of each diagram is the raw semantic similarity between passages, as computed by the all-MiniLM-L6-v2 model. The right hand side highlights the entries which are more than two standard deviations away from the mean, and are anomalously similar compared to the rest of the book.
A Mathematician’s Apology
“It is a melancholy experience for a professional mathematician to find himself writing about mathematics…”
[[A Mathematician’s Apology, Hardy]]N is a famous book by G.H. Hardy defending studying mathematics outside of its applications. It’s written as a series of very short chapters (some just one page) which are all inter-connected, which I thought might give the diagram an interesting structure.
The part of this diagram which stands out the most is the dark band just before the 40th section. This band corresponds to Chapter 12 and Chapter 13, where Russell presents Euclid’s proof of the infinitude of the primes and Pythagoras’ proof of the irrationality of root 2. Most of the book is very non-mathematical, and these two chapters are the only ones containing actual mathematics, so it makes sense that these are dissimilar to the rest of the book.
Looking at features and then coming up with explanations for them isn’t very scientific, because I’m starting with the conclusion and then coming up with the explanation. So don’t trust everything I say!
The Apology of Socrates
“I know not, O Athenians! how far you have been influenced by my accusers for my part, in listening to them I almost forgot myself, so plausible were their arguments however, so to speak, they have said nothing true.”
The Apology is the second dialogue in [[The Trial and Death of Socrates, Plato]]N where Socrates defends himself on trial before the court of Athens. It roughly splits into three sections:
- In the first, Socrates defends himself against the accusations of impiety and corrupting the Athenian youth,
- In the second, Socrates is voted as guilty and argues about how he should be punished,
- In the third, Socrates is condemned to death.
The King James Bible
“In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters…”
An early translation of the Bible into the 1600s, the pattern is a lot finer than the others since the Bible is by far the longest book on this list. I don’t know much about the Bible, so can’t explain the features here.
Frankenstein
“You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday; and my first task is to assure my dear sister of my welfare, and increasing confidence in the success of my undertaking…”
[[Frankenstein, Shelley]]N is written as a frame narrative as letters between a captain of an expedition to the North Pole and his sister.
The clump of brightness around the 150th section of the book corresponds to a trial of one of the characters in the book, and the band of darkness around the 225th section happens during the extended description of the DeLacey family.
Gone Girl
“When I think of my wife, I always think of her head. The shape of it, to begin with. The very first time I saw her, it was the back of the head I saw, and there was something lovely about it, the angles of it.”
[[Gone Girl, Flynn]]N is a fiction about a disappearance under strange circumstances. It alternates between two different narratives, written by two different people. This might explain the checkerboard pattern in the diagram.
Harry Potter and the Philosopher’s Stone
“Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense…”
[[Harry Potter and the Philosopher’s Stone, Rowling]]N is the first book in the Harry Potter series.
The large clump of brightness at the start describes a trip to the zoo that doesn’t end very well. Some of the far-off diagonal entries are interesting also: the very small clump around the coordinate $(20, 375)$ are both parts of the text discussing Dumbledore.
One Hundred Years of Solitude
“Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice.”
[[One Hundred Years of Solitude, Márquez]]N is a novel telling the story of seven generations of the Buendía family in Colombia.
Meditations
“Of my grandfather Verus I have learned to be gentle and meek, and to refrain from all anger and passion…”
[[Meditations, Aurelius]]N was the private journal of Marcus Aurelius who was the Roman emperor between 161 AD and 180 AD. There’s a lot more off-diagonal highlights in this book, but the overall structure seems quite muddy.
Slaughterhouse Five
“Listen: Billy Pilgrim has come unstuck in time.”
[[Slaughterhouse Five, Vonnegut]]N is a semi-autobiographical account of Kurt Vonnegut’s experiences during the fire-bombing of Dresden. It’s written from the perspective of “Billy Pilgrim” and repeatedly jumps non-linearly between many points in Billy Pilgrim’s life.
The book opens with a long non-fiction prologue about Kurt Vonnegut came to writing it, which could explain why it clearly separates into two sections. Maybe the checkerboard pattern in the second half is because of the narrative jumping around so much?
The Adventures of Sherlock Holmes
“To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name.”
[[The Adventures of Sherlock Holmes, Doyle]]N was the first collection of 12 short stories about Sherlock Holmes and John Watson.
Cross-text similarity
If you concatenate the text for one book after another, you can see sections of cross-text similarity.
I was hoping that you could spot lots of cross-text similarity this way, but most of the time it’s hard to make anything out. I think part of the reason is that it’s a lot easier for a book to be self-similar than similar with another text, so the self-similarity outliers drown out the cross-text similarity outliers.
Meditations + Apology
Where one book ends and the other begins is apparent. Note also that there is a copy of the individual diagrams in this one:
If you squint, there is a couple of bright dots around $(150, 350)$. This corresponds to a point in Meditations where Aurelius is talking about Socrates.
Harry Potter + The Adventures of Sherlock Holmes
There’s not much cross-text similarity here:
Although there is one dot of light around $(680, 220)$. This corresponds to the passages:
…We had pulled up in front of a large villa which stood within its own grounds. A stable-boy had run out to the horse’s head, and springing down, I followed Holmes up the small, winding gravel-drive which led to the house. As we approached, the door flew open, and a little blonde woman stood in the opening, clad in some sort of light mousseline de soie, with a touch of fluffy pink chiffon at her neck and wrists. She stood with her figure outlined against the flood of light, one hand upon the door, one half-raised in her eagerness, her body slightly bent, her head and face protruded, with eager eyes and parted lips, a standing question.
“Well?” she cried, “well?” And then, seeing that there were two of us, she gave a cry of hope which sank into a groan as she saw that my companion shook his head and shrugged his shoulders.
“No good news?”
“None.”
“No bad?”
“No.”
“Thank God for that. But come in. You must be weary, for you have had a long day.”
“This is my friend, Dr. Watson. He has been of most vital use to me in several of my cases, and a lucky chance has made it possible for me to bring him out and associate him with this investigation.”
“I am delighted to see you,” said she, pressing my hand warmly. “You will, I am sure, forgive anything that may be wanting in our arrangements, when you consider the blow which has come so suddenly upon us.”…
and
…“Go away.” “All right, but I warned you, you just remember what I said when you’re on the train home tomorrow, you’re so –“
But what they were, they didn’t find out. Hermione had turned to the portrait of the Fat Lady to get back inside and found herself facing an empty painting. The Fat Lady had gone on a nighttime visit and Hermione was locked out of Gryffindor tower.
“Now what am I going to do?” she asked shrilly.
“That’s your problem,” said Ron. “We’ve got to go, we’re going to be late.”
They hadn’t even reached the end of the corridor when Hermione caught up with them.
“I’m coming with you,” she said.
“You are not.”
“D’you think I’m going to stand out here and wait for Filch to catch me? If he finds all three of us I’ll tell him the truth, that I was trying to stop you, and you can back me up.”
“You’ve got some nerve –” said Ron loudly.
“Shut up, both of you!” said Harry sharply. I heard something.”
It was a sort of snuffling.
“Mrs. Norris?” breathed Ron, squinting through the dark.
It wasn’t Mrs. Norris. It was Neville. He was curled up on the floor, fast asleep, but jerked suddenly awake as they crept nearer…
I can’t really understand why these sections were marked as similar.
References and more links
-
semantic-footprints
, code used to generate diagrams. - “Why we really really really like repetition in music” by Vox.
-
SongSim by Colin Morris, the original application for generating the lyrical diagrams.
- The diagram about how the lyrical footprints works comes from this page.
- Song Visualiser, a more colourful version of SongSim.
- The Illustrated Word2vec by Jay Alammar, source of the diagram for embeddings.
- Project Gutenburg, sources for the text of some of the books I used.
- [[About this website]]B, also discusses embeddings.
- “Using GPT4 to measure the passage of time in fiction” by Ted Underwood talks about using LLMs to measure how fast time passes in different novels.