Eamonn O'Brien-Strain

MastodonBlueskyThreads

There are many trust and safety challenges in the new generative AI technologies, but there is one area where they could increase trust and user empowerment. These technologies provide an opportunity to offer the kind of transparency that will allow meaningful control of how people use complex online systems, including control of privacy.

This opportunity comes from two observations: (1) that the biggest problem in privacy is explaining to the user how their data is used, and (2) that one of the notable abilities of LLMs (large language models) is to summarize complex data understandably.

Over the years working on Internet systems, I have seen big improvements in protecting privacy. Some of this improvement is driven by the increasing public awareness of the importance of privacy and the necessity for companies to address privacy if they want to maintain user trust. Some of this is driven by the need for regulatory compliance, initially with GDPR in Europe, but increasingly with new regulations in various countries and US states.

But what do companies actually do to respond to retain trust and keep in compliance? Let’s divide privacy protection measures into two categories: backend and frontend.

Backend privacy protection is where most of the effort has gone. Much of the work here is around data flows, identifying and controlling how personal data is transmitted through and stored in the complex infrastructure behind large modern Internet systems. While practically doing this can be a difficult engineering task, the requirements are generally well understood.

Frontend privacy protection is much more of an open problem. The areas of understanding and consensus are limited to a few areas such as what “dark patterns” should be avoided and how to create cookie consent UIs (which everyone hates). In particular, there remains the biggest unsolved problem, which is how to give people meaningful agency over how their data is used, given the systems are so complex that it is very difficult even for the engineers building and running the services to explain.

But now we see the opportunity. Explaining complex subjects is one thing that LLMs are good at.

LLM privacy transparency

One approach is, given an existing system that has personal data flowing through it, for a particular person using the system, we generate a comprehensive description of all their data and how it is used, perhaps in the context of a particular feature they are using. This raw description would be voluminous, highly technical, and perhaps might contain references to proprietary information, so it would be not at all useful or appropriate to display to the person. However an LLM, with an appropriate prompt, could summarize this raw dump in a way that could be safely and meaningfully displayed to the person. This could provide transparency, customized to the particular context. With different prompts, the LLM output format could be adjusted to match the reading level of the person, and to the size and formatting constraint of the part of the UI in which it is displayed.

This transparency is good, and it would help give a sense of agency to the person. But is there a way to take this further and additionally use LLMs to provide controls?

LLM privacy controls

Well, yes, in some cases if an LLM is incorporated into the system and helps personalize the output, then we can take advantage of the fact that the “API” of an LLM is natural language. That means that somewhere deep in the data flow is some human-meaningful text that is being ingested into an LLM. So we have an opportunity to reveal that text to the person using the system and allow them to modify it, possibly by simply adding or modifying freeform natural language text.

Of course, there are many challenges and possible hazards to using LLMs in these ways. For the transparency proposal, LLMs can hallucinate and generate incorrect summaries of personal data which could be confusing or possibly disturbing to the person. Even if the summary is factual it could present it in a biased manner, for example using gender or racial stereotypes. There is also the possibility that the summary, even if correct and unbiased, could be alarming to the person, but that is arguably a case of “working as intended”: it is better for long-term trust for the person to learn this sooner rather than later, and to thus be able to take prompt action to control how their data is used.

I’m not aware of any such systems yet launched, but I’m hoping it will happen, and in so doing harness the power of generative AI to empower people to make the appropriate trade-offs in each context for how much personal data they want to be used in return for a particular benefit.


sun

As the planet warms due to climate change, the threat of heat waves looms larger than ever. Extreme heat isn't just uncomfortable; it can be deadly, especially when combined with high humidity.

To help visualize this growing danger, I've created a new website: Dangerous Heatwaves

What Makes a Heat Wave Dangerous?

The site focuses on a key metric called the wet-bulb temperature. This isn't the temperature you see on the thermometer. Instead, it's the lowest temperature you can reach by evaporating water – a crucial concept for understanding how humans handle heat.

We cool down by sweating, a process that relies on evaporation. When the wet-bulb temperature gets too close to our body temperature, sweating becomes ineffective. That's when the risk of heatstroke and other heat-related illnesses skyrockets.

  • Low humidity: Even with high temperatures, low humidity means a lower wet-bulb temperature, reducing the danger.
  • High humidity: This is the worst-case scenario. When it's both hot and humid, the wet-bulb temperature rises, making conditions extremely hazardous.

How the Site Works

The Dangerous Heatwaves site analyzes weather forecasts for locations around the world. It highlights the areas with the highest predicted wet-bulb temperatures in the coming days, giving you a real-time snapshot of where the risk of dangerous heat is greatest.

Why This Matters

Understanding wet-bulb temperature and its impact is essential for preparedness and planning. Whether you're concerned about your health, outdoor activities, or the well-being of vulnerable populations, this tool can help you stay informed and make smart decisions in the face of extreme heat.


Fascinating swarm dynamics in this flow of ants down my driveway in Calistoga


How simple can a blogging platform be?

I tried to build a simple blog for anyone with a GitHib account.

How to use it

All you do is

  1. Fork a repo
  2. Do a small amount of configuration of your new GitHub repo
  3. Use the GitHub web UI to edit markdown files
  4. Your blog gets automatically published as GitHub pages

The GitHub repo with full instructions is at simplestblog

An example of a blog that uses this is eobrain.github.io/mysimplestblog

How it was built

It is a simple Node.js JavaScript app that is built on a simple foundation:

  • A markdown library that converts markdown to HTML
  • The Mustache library for building pages from templates

WriteFreely is a fantastic minimalist blogging platform with Fediverse integration. If you're self-hosting WriteFreely (like this blog!), it's wise to maintain backups for peace of mind. Here's how to export your WriteFreely blog using a tool I created called writefreely-export.

Prerequisites:

  1. Shell access to your WriteFreely server
  2. Node.js installed (I recommend using nvm for managing Node.js versions – installation instructions here)

Steps:

Clone the repository and cd into it:

git clone https://github.com/eobrain/writefreely-export.git
cd writefreely-export

Ensure Node.js compatibility (if using nvm):

nvm use

Finally to do the export do

npm install
npm run export

This creates a content directory containing Markdown files of your WriteFreely posts.

Using the Exported Files

One option is to import your Markdown files into a static site generator like simplestblog. This approach gives you a static backup and an alternate site. For example, this blog is mirrored at mysimplestblog.

Additional Notes:

Consider automating the export process for regular backups.


I've long admired how the Tufte CSS project allowed you to create web pages in the style the legendary Edward Tufte developed.

One improvement I've wished for is the ability to use the Tufte CSS with well-structured semantic HTML, without requiring sprinkling class attributes around the CSS.

So I forked the project and created a new one called, which you can check out at: Classless Tufte CSS.

You can try it out by including this in the <head> section of your HTML:

<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/eobrain/classless-tufte-css@v1.0.1/tufte.min.css"/>

Your HTML requires no class attributes but should be in a standard semantic structure like this:

<body>
  <article>
    <h1> title </h1>
    <p> subtitle </p>
    <section>
       <h2> section header </h2>
       ... paragraphs, lists, code blocks, figures etc
    </section>
    <section>
    ...

The main features lost in moving to the classless form were

  • There is no automatic sidenote numbering
  • The <aside>s used to implement margin material must be between paragraphs, they cannot be embedded in the middle of a paragraph.

There are some added features though:

  • Lists that are in between paragraphs are also put in the margin.
  • The margin material does not disappear for narrow screens. Instead, it is shown inline, indented.
  • Code blocks have a subtle background shading
  • Tables have some lines

Here is a comparison of the HTML between Tufte CSS and Classless Tufte CSS, so you can see the simplification.

Tufte CSS Classless Tufte CSS
All-caps initial text ...<p><span class="newthought">A new thought</span> comes to me... ...<section><p>A new thought comes to me...
Epigraph ...<div class="epigraph"><blockquote><p>... ...</h2><blockquote><p>...
Sidenote reference <label for="sn-extensive-use-of-sidenotes" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-extensive-use-of-sidenotes" class="margin-toggle"/>
Sidenote <span class="sidenote">This is a sidenote.</span> ...</p><aside><sup>1</sup> This is a sidenote.</aside>
Margin note <label for="mn-demo" class="margin-toggle">&#8853;</label><input type="checkbox" id="mn-demo" class="margin-toggle"/><span class="marginnote">This is a margin note. Notice there isn’t a number preceding the note.</span> ...</p><aside>This is a margin note. Notice there isn’t a number preceding the note.</aside>
Fullwidth figure ...<figure class="full-width">... </section><figure>...
iframe wrapper <figure class="iframe-wrapper"><iframe width="853" height="480" src="https://www.youtube.com/embed/YslQ2625TR4" frameborder="0" allowfullscreen></iframe></figure> <figure><iframe width="853" height="480" src="https://www.youtube.com/embed/YslQ2625TR4" frameborder="0" allowfullscreen></iframe></figure>

What prompted me to do this is that I'm currently working to convert this blog to a static version, and I wanted to use Tufte CSS for plain semantic HTML generated from markdown.


As part of my work on dealing with regulation in the EU, I’ve come to a simplistic understanding of EU governing bodies by analogy with equivalent US governing bodies:

  1. Lower House: The European Parliament is the US House of Representatives.
  2. Upper House: The European Council is the US Senate from the early US when the states directly appointed senators
  3. Executive: The European Commission is the US President and Cabinet, except there is exactly one cabinet member from each state

On both sides of the Atlantic, both the lower and upper house must agree on legislation before it is passed.

One difference is that in the EU the executive proposes legislation, but in the US it can also be proposed by the upper or lower house

Another practical difference is the US bodies have only two political parties because of the first-past-the-post voting system. In contrast, the EU bodies have many political parties because of greater diversity and more proportional voting systems.

And of course in Europe, the individual member states retain much more sovereignty than US states do (despite what Texas might think).

Update 2024-03-01, more analogs in response to a comment on BlueSky:

EU conciliation committees are US congressional conference committees, except also including some cabinet members, and with a deadline.

Trilogue is when the president invites leaders of both houses of Congress to the White House for an informal chat to move some legislation forward.


Big Tech has become pretty good at back-end data protection flows, but many problems still remain in the front-end, user-facing aspects of privacy.

Here are some hard problems that I spend my days thinking about

  1. What exactly is good consent, and how do we make sure users are giving it?
  2. How can we give average, non-technical users the agency to manage the trade-off between privacy and functionality, given how insanely complex the data systems and products are?
  3. How do we measure whether we are meeting people's privacy needs and expectations? Can we make these measurements in a way that is actionable in informing how we change our products?
  4. How can we empower particularly vulnerable people to protect themselves? (e.g. victims of domestic abuse, dissidents in repressive regimes, LGBTQ people in non-accepting cultures, people seeking abortion information in certain US states)
  5. How do we avoid adding usability burdens that reduce the product value for the majority of times when people are not particularly concerned about privacy, while still making sure they are empowered to take privacy-protective measures for sensitive user journeys?
  6. What are the privacy threat models, and what are the different ways of adding UI features to counter them? Some of the threat models can be countered by controlling data collection: such as threats from state actors subpoenaing user data. Some of the threat models can be countered by controlling data use, such as threats from people shoulder surfing or compelling physical access to devices or accounts.
  7. How do we avoid the unintended consequences of actually making people more vulnerable with well-meaning trust measures? For example, providing transparency of what we know about a user is good for empowering them to take action, but it also adds a new privacy attack vector by providing a convenient UI for a bad actor who has access to the user account. Or adding controls to allow the the user to specify topics or URLs that they consider sensitive and not to be tracked, is itself a very sensitive list that could be harmful if revealed. Or if we try to protect particularly vulnerable people by noticing they are vulnerable, that detection of their status might be privacy-invasive.

TL;DR: Fediverse instances that federate with Threads could filter profiles to counter reidentification threats.

The new Threads app has had a very successful launch, and they have said they will add ActivePub support allowing them to federate with Mastodon and other Fediverse instances.

This has caused a lot of angst on Mastodon, with many people philosophically opposed to interoperation with such a “surveillance capitalism” platform. Many instance operators have vowed to defederate from Threads, blocking the cross-visibility of posts and accounts between Threads and their instances.

However, @gargron@mastodon.social, the founder and CEO of Mastodon has written positively about allowing federation with thread, including saying

Will Meta get my data or be able to track me?

Mastodon does not broadcast private data like e-mail or IP address outside of the server your account is hosted on. Our software is built on the reasonable assumption that third party servers cannot be trusted. For example, we cache and reprocess images and videos for you to view, so that the originating server cannot get your IP address, browser name, or time of access. A server you are not signed up with and logged into cannot get your private data or track you across the web. What it can get are your public profile and public posts, which are publicly accessible.

I decided to see how true this is, by looking at my own @eob@social.coop Mastodon account and looking to see what personal data of mine Meta would see if they were just fetching data that a well-behaved federated server would fetch.

I emulated the server-to-server calls that Threads would make when a Threads use looked me up on the Mastodon instance social.coop which hosts my account.

  1. Use webfinger to find my endpoints

    curl https://social.coop/.well-known/webfinger? 
    resource=acct:eob@social.coop 
    
  2. Amongst the data returned from the above is the link to my profile, which allows the profile information to be fetched

    curl  -H "Accept: application/json" https://social.coop/@eob
    
  3. In the profile are the links to my “outbox” URL https://social.coop/users/eob/outbox from where all my public posts can be fetched.

This seems fine so far. These posts are public anyway, so it's fine that Threads can see them.

However, there is a problem with Step #2 above. The profile data is not just pointers to URLs, but also a lot of other personal information. Here it is (converted from JSON to YAML, which is an easier format to read and with some boilerplate and media data removed):

...
id: 'https://social.coop/users/eob'
type: Person
following: 'https://social.coop/users/eob/following'
followers: 'https://social.coop/users/eob/followers'
inbox: 'https://social.coop/users/eob/inbox'
outbox: 'https://social.coop/users/eob/outbox'
featured: 'https://social.coop/users/eob/collections/featured'
featuredTags: 'https://social.coop/users/eob/collections/tags'
preferredUsername: eob
name: Éamonn
summary: >-
  <p><a href="https://social.coop/tags/Privacy" class="mention hashtag"
  rel="tag">#<span>Privacy</span></a> <a href="https://social.coop/tags/UI"
  class="mention hashtag" rel="tag">#<span>UI</span></a> in <a
  href="https://social.coop/tags/SanFrancisco" class="mention hashtag"
  rel="tag">#<span>SanFrancisco</span></a>, leading a team building UI changes
  giving transparency and control over personal data in a large search
  engine.</p><p>Previously infrastructure for developer tools. Earlier HP Labs
  (IoT and computational aesthetic), a dot-com bust startup, and  high level
  chip design software at Cadence, Bell Labs, and GEC Hirst Research
  Centre.</p><p><a href="https://social.coop/tags/Irish" class="mention hashtag"
  rel="tag">#<span>Irish</span></a> born and bred, now living in Northern
  California.</p><p>Opinions here are my own; I&#39;m definitely not speaking
  for my employer.</p><p><a href="https://social.coop/tags/tfr" class="mention
  hashtag" rel="tag">#<span>tfr</span></a> <a
  href="https://social.coop/tags/fedi22" class="mention hashtag"
  rel="tag">#<span>fedi22</span></a></p>
url: 'https://social.coop/@eob'
manuallyApprovesFollowers: false
discoverable: true
published: '2022-10-29T00:00:00Z'
devices: 'https://social.coop/users/eob/collections/devices'
alsoKnownAs:
  - 'https://sfba.social/users/eob'
publicKey:
  id: 'https://social.coop/users/eob#main-key'
  owner: 'https://social.coop/users/eob'
  publicKeyPem: |
    -----BEGIN PUBLIC KEY-----
    MIIBIjA...AQAB
    -----END PUBLIC KEY-----
tag:
...
attachment:
  - type: PropertyValue
    name: Born
    value: 'Dublin, Ireland'
  - type: PropertyValue
    name: Lives
    value: 'San Francisco, USA'
  - type: PropertyValue
    name: Pronouns
    value: he/him
  - type: PropertyValue
    name: GitHub
    value: >-
      <a href="https://github.com/eobrain" target="_blank" rel="nofollow
      noopener noreferrer me"><span class="invisible">https://</span><span
      class="">github.com/eobrain</span><span class="invisible"></span></a>
endpoints:
  sharedInbox: 'https://social.coop/inbox'
icon:
...
image:
...

We can assume that Threads would put all this data into a Meta database keyed off my Mastodon identifier @eob@social.coop or equivalent.

I also have a Facebook and Instagram account, which is also stored in a Meta database, keyed off my Facebook user ID.

The big question, and the privacy threat model, is whether Meta can associate (“join”) these two database entries as belonging to the same person so that they can use Mastodon data to optimize ad targeting and feed algorithms.

The good news is that the profile data above does not include an explicit data field that could be used as a joining identifier.

But there is a lot of free-form text that could be fed into matching algorithms, and even though I was a little careful, by for example not including my last name, I suspect that Meta could reidentify me and associate the Mastodon account with my Facebook account.

So one response to this would be to ask people on Mastodon to edit their profiles to make them more anonymous, as most people did not write them on the assumption that the data would be fed into the maw of the Meta ad machine.

But maybe a better solution would be to modify the ActivityPub software in the server to filter out identifying profile information when federating with an instance like Threads. For example:

  • The attachments section could be removed as it has easily harvested structured personal data
  • The summary section could be replaced with a link to the profile on the Mastodon server, so the human could click through and see it, but Meta servers would not.

The result would be something more privacy-preserving like:

...
id: 'https://social.coop/users/eob'
type: Person
following: 'https://social.coop/users/eob/following'
followers: 'https://social.coop/users/eob/followers'
inbox: 'https://social.coop/users/eob/inbox'
outbox: 'https://social.coop/users/eob/outbox'
featured: 'https://social.coop/users/eob/collections/featured'
featuredTags: 'https://social.coop/users/eob/collections/tags'
preferredUsername: eob
name: Éamonn
summary: ´<a href=¨https://social.coop/@eob¨>profile</a>´
url: 'https://social.coop/@eob'
manuallyApprovesFollowers: false
discoverable: true
published: '2022-10-29T00:00:00Z'
devices: 'https://social.coop/users/eob/collections/devices'
alsoKnownAs:
  - 'https://sfba.social/users/eob'
publicKey:
  id: 'https://social.coop/users/eob#main-key'
  owner: 'https://social.coop/users/eob'
  publicKeyPem: |
    -----BEGIN PUBLIC KEY-----
    MIIBIjA...AQAB
    -----END PUBLIC KEY-----
tag:
...
attachment:
endpoints:
  sharedInbox: 'https://social.coop/inbox'
icon:
...
image:
...

Last month, as described in Building a Mastodon AI Bot, I built @elelem@botsin.space, an AI-powered bot that responds to anyone who mentions it. It's been running for a while, and has produced some interesting output, especially when caught in loops with other bots like @kali@tooted.ca and @scream@botsin.space (Unleashing an AI Bot on Mastodon)

This weekend I tried creating another, less interactive bot. This one, @delayedheadlines@botsin.space, scrapes Wikipedia to find events that happened on this day in history 50, 100, 200, 300, etc. years ago. It then uses ChatGPT to summarize the historical events in the style of a tabloid headline. For example:

Sunday, May 20, 1923:

SHOCKING! British Prime Minister RESIGNS due to CANCER!

MEDICAL advisers release DISTURBING announcement regarding Prime Minister's HEALTH!

KING GEORGE V graciously accepts RESIGNATION from Right Honorable A. Bonar Law!

ALSO: Mestalla Stadium OPENS in Valencia, Spain and former Russian Imperial Army General EXECUTED for TREASON!

https://en.m.wikipedia.org/wiki/May_1923#May_20,_1923_(Sunday)

Architectural diagram of bot

The source code is on GitHub, comprising about 160 lines of server-side JavaScript, using pretty much the same architecture as for the @elelem@botsin.space bot.

const accessToken = process.env.MASTODON_ACCESS_TOKEN
const mastodonServer = process.env.MASTODON_SERVER
const baseUrl = `https://${mastodonServer}`

const headers = {
  Authorization: `Bearer ${accessToken}`
}

export async function toot (status) => {
  const body = new URLSearchParams()
  body.append('status', status)
  await fetch(${baseUrl}/api/v1/statuses`, {
    method: 'POST',
    headers,
    body
  })
}

The Mastodon interface was the easiest, as it just needed a single toot function to post the status, which would be the tabloid headlines text. The nice thing is that the Mastodon REST API is so straightforward, that for simple cases like this there is no need to use a JavaScript library. You can just use the build-in fetch function.

Note the security precaution of not including the Mastodon access token in the checked-in code. Instead I keep that in a non-checked in file and pass it to the running program in an environment variable.

import { Configuration, OpenAIApi } from 'openai'

const STYLE = 'supermarket tabloid headlines'

const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY
})
const openai = new OpenAIApi(configuration)

export async function addPersonality (text) {
  const completion = await openai.createChatCompletion({
    model: 'gpt-3.5-turbo',
    messages: [{
      role: 'user',
      content: `
  Rewrite the following text in the style of ${STYLE}:

  ${text}
  
  `
    }]
  })
  return completion.data.choices[0].message.content
}

For the LLM, I used a different API call than I did for Elelem. For that one I used the lower-level completion API, where you have full control of the context fed into the LLM, which for that bot was a Mastodon thread and some personality-setting instructions.

For this bot I just needed the LLM to rewrite one piece of text, so I used the higher level chat API, which is similar to the online ChatGPT app. It is significantly cheaper than the completion API.

import { parse } from 'node-html-parser'

function filter (text) {
    ...
    if (line.match(/Born:/)) {
      on = false
    }
    ...
}

export async function thisDay (yearsAgo) {
  ...
  const monthUrl = `https://en.m.wikipedia.org/wiki/${monthString}_${year}`
  const monthResponse = await fetch(monthUrl)
    ...
    const monthHtml = await monthResponse.text()
    const monthRoot = parse(monthHtml)
    const ids = [
      `${monthString}_${day},_${year}_(${weekdayString})`,
      `${weekdayString},_${monthString}_${day},_${year}`
    ]
    const citations = []
    for (const id of ids) {
      const nephewSpan = monthRoot.getElementById(id)
      ...
      const parent = nephewSpan.parentNode
      const section = parent.nextSibling
      const text = filter(section.innerText)

      return { found, text, then, citation }
    ...
  }

  const yearUrl = `https://en.m.wikipedia.org/wiki/${year}`
  const yearResponse = await fetch(yearUrl)
  ...
    const yearHtml = await yearResponse.text()
    const yearRoot = parse(yearHtml)
    const pattern = `${monthString} ${day} . `
    for (const li of yearRoot.querySelectorAll('li')) {
      if (li.innerText.match(pattern) && !li.innerText.match(/ \(d\. /))
        ...
        const text = li.innerText.slice(pattern.length)
        ...
        return { found, text, then, citation }
      ...

The most complex part of the app's code is the scraping of Wikipedia. It turns out that daily events are not stored in a very consistent way. For the last century or so there is a Wikipedia page per month, so the scraper has to find the section within the page, using one of two different patterns of HTML id attributes to find a nearby DOM element and navigate to the desired text. For older centuries there is a Wikipedia page per year, requiring a different way of finding the target day's events. In both cases I filtered out births, because headlines of the day would not know in most cases which births were significant.

What made writing this code easier was the node-html-parser library, which provides a browser-style DOM API to this server-side code.

So have a look at @delayedheadlines@botsin.space to see what headlines have been posted so far, and you might want to follow the bot if you have a Mastodon account. Or if you want to try rolling your own bot, you could use this code as a starting point.