Privacy

2816 readers

545 users here now

Welcome! This is a community for all those who are interested in protecting their privacy.

Rules

PS: Don't be a smartass and try to game the system, we'll know if you're breaking the rules when we see it!

Be civil and no prejudice
Don't promote big-tech software
No apathy and defeatism for privacy (i.e. "They already have my data, why bother?")
No reposting of news that was already posted
No crypto, blockchain, NFTs
No Xitter links (if absolutely necessary, use xcancel)

Related communities:

Some of these are only vaguely related, but great communities.

founded 7 months ago

MODERATORS

[email protected]

‘It’s terrifying’: WhatsApp AI helper mistakenly shares user’s number (www.theguardian.com)

submitted 1 day ago by [email protected] to c/[email protected]

13 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 21 hours ago (1 children)

I know, I'm just saying it's not theoretically impossible to have a phone number as a token. It's just probably not what happened here.

the choice of the next token is really random

It's not random in the sense of a uniform distribution which is what is implied by "generate a random [phone] number".

[–] [email protected] 1 points 20 hours ago* (last edited 20 hours ago) (1 children)

Unless the person is use math terms elsewhere, I always assume people mean 'unexpected' then they say random.

It’s not random in the sense of a uniform distribution which is what is implied by “generate a random [phone] number”.

Yeah, true.

There, I was speaking more to the top level comment's statement that an LLM cannot generate random numbers. Random numbers are pretty core to how chatbots work... which is what I assumed they meant instead of the literal language model.

You could say that they're technically correct in that the actual model only produces a deterministic output vector for any given input. Randomness is added in the implementation of the chatbot software through the design choice of having the software treat the language model's softmax'd output as a distribution from which it randomly chooses the next token.

But, I'm assuming that the person isn't actually making that kind of distinction because of the second sentence that they wrote.

[–] [email protected] 2 points 19 hours ago

The point of my second statement is that if you made an AI that stores and retrieves phone numbers that the model could reasonable use phone number chunks in its random number generation. A phone number can normally be broken into 3 to 6 chunks of 1 to 5 numbers which is reasonable sizes to tokenize. If you then asked it for a random number I think it is reasonable that it would be as likely if not more likely to use the data from the phone number list as it would to use the core 0 to 9 tokenized number list unless you specifically tried to split the two.

This is a WhatsApp AI so I think asking it for Tim's number is a use case they trained on. It needs to be a phone book. My guess is they said that list A is a list of public numbers for training things like what a phone number looks like, and list B is a list of private user numbers. Now while a random number could be a random string of numbers it could also be that the LLM is too likely to pull a combination that is actually a real number.

So is this a case where it randomly pulled together 11 digits that magically hit the roughly 1 in in 100 chance that a random string of numbers shaped like a UK phone number would be a number of a user. Is it a case where it pulled from a public combo list of 4 tokens and randomly reformed a real number that was both public and private? That seems more likely to me. We probably won't ever get to know.

If I was making this AI chat bot I would have it check against the most critical data I have for privacy before it shared it as a random number though. WhatsApp phone numbers are its users IDs. Even if it truly randomly generates one it should verify that it is a private number and not output it as it showed it could do when questioned where the number came from.