What is inside the Adobe hacked database file?

Adobe break-in

Adobe recently suffered from a break-in where intruders were able to get hold of Adobe users’data, containing email addresses, encrypted passwords, password hint names, etc.

This break-in was acknowledged by Adobe (note that the acknowledgement page from Adobe does not have a date or timestamp, at least not on Nov-17, it only mentions ‘recently’).

The posting on the Sophos blog by Paul Ducklin provides a very interesting overview on the cryptographic blunders made by Adobe. In this post I’ll focus on the content of the file, not the different cryptographic weaknesses.

Data dump file

It didn’t took long before a data dump, pretending to be the hacked customer database, was available for download on the Internet. The file was a 4G .tar.gz file called Base_users_adobe.com.tar.gz. This file extracted to a 9G file called cred with 153004874 lines. These are the md5sums of the files :

e3eda0284c82aaf7a043a579a23a09ce  Base_users_adobe.com.tar.gz
020aaacc56de7a654be224870fb2b516  cred

I have no reason to believe this file was crafted but I also have no valid proof that this is the “real” database dump. Regardless of that, the data in the file is useful enough to do some statistics against.

If you think that your account might be affected or is in the data dump file then use the online verification tool from LastPass.

Structure of the dump file

The structure of the file is already described in the post on the Sophos blog. Basically every record contains six fields, separated by a pipe sign (|). The data was always embedded between hyphens (-).

  • 1 : the userID
  • 2 : a blank field, only containing the hyphens
  • 3 : the email address
  • 4 : password data
  • 5 : the password hint
  • 6 : a blank field, only containing the hyphens

Some unusual aspects where found in the file:

  • The third field, containing the email address, did not contain a valid email address in a couple of cases. It contained what seemed to be only a username (without the @domain part). Maybe these were older entries where usernames where not equal to the email address? See further down this post for the statistics on valid and non valid email addresses;
  • Not all the records had a password hint;
  • The first userID was 103238704, the last userID was 209850522;
  • The last line of the data file contained the string “152989508 rows selected.”.

Processing the file

I don’t think it made sense to process all the records. A sample of the data would also return reasonable valid results. I created a PHP script that read every line of the file and then processed every 250th line.

$rec_processed = 0;
$skip_row = 250;
while (($buffer = fgets($handle, 4096)) !== false) {
  if ($rec_processed % $skip_row) {

Processing the line consisted in exploding the data into an array, where the split pattern is the pipe and the hyphen (|-).

$line = explode("|-", $buffer);

I then used a combination of substr and again explode to put the values into different fields.

I used a couple of booleans to check for specific domaintypes.

if (strpos($split_email[1], "mil.") !==false)   $mail_mil = true;
if (strpos($split_email[1], "gov.") !==false)   $mail_gov = true;
if (strpos($split_email[1], "fgov.") !==false)  $mail_gov = true;
if (strpos($split_email[1], "fed.") !==false)   $mail_gov = true;

With the last step I stored all the variables in a database table with the structure below. Most of these field are self-explanatory. The table is not speed or storage optimized but it is was perfect for this analysis.

  email varchar(100) NOT NULL,
  tld varchar(10) NOT NULL,
  domain varchar(50) NOT NULL,
  fulldomain varchar(250) NOT NULL,
  lasttwo varchar(100) NOT NULL,
  resetpw varchar(255) NOT NULL,
  resetpw_len int(11) NOT NULL,
  resetpw_data tinyint(1) NOT NULL,
  resetpw_numeric tinyint(1) NOT NULL,
  resetpw_count int(11) NOT NULL,
  mail_mil tinyint(1) NOT NULL,
  mail_gov tinyint(1) NOT NULL,
  userid varchar(15) NOT NULL,
  hash varchar(255) NOT NULL,
  validemail tinyint(1) NOT NULL,

The whole processing and storing in a database took a while and resulted in 609524 records and a table of 156.8 MB.


Remember that the statistics below are all made against the sampled data (1/250). For the actual numbers you’ll have to multiply with a factor 250, at maximum.

Valid and non valid email address

Out of a total of 609524 records there were 2057 records with a non valid email address and 607467 with a valid email address.

Valid / non valid email address
Valid email Non valid email
607467 99.66% 2057 0.34%

Records with password reset data

There were 172076 records with password reset data and 437448 records had an empty password reset field. Note that out of the 172076 records with password reset data there were 1456 records that did not had a valid email address.

Records with password reset data
With password reset Without password reset
172076 28.23% 437448 71.77%

Password reset length

The maximum password reset length used was 50 (used in 93 records). Strangely enough there were also 2339 records with a password reset length of 1. Most password reset data had a length of approx. 10 or less. Below is a sample of records (uid, email, password reset data) with password reset length of 1.

10487xxxx | @live.co.uk    | !
10858xxxx | @yahoo.com.tw  | 0
10866xxxx | @hotmail.com   | c
10867xxxx | @aon.at        | s

This is a sample of records with password reset length of 12.

onesty-brokerpark.de    | wie Apple ID
hotmail.com             | usual p/word 
casema.nl               | voornaam+100 
vlaspand.be             | 2 x deurcode 
dds.nl                  | Lelijk woord 
gmail.com               | myself ph no  
alliancemediagroup.net  | name, usual#

Password reset length
Password reset length Occurrences
1 to 10 135506 22,23%
11 to 20 31054 5,09%
21 to 30 4453 0,73%
31 to 40 773 0,12%
41 to 50 290 0,04%
0 or no data 437448 71,77%

Password reset number of words

I also counted the number of words used in the password reset data. A word is anything that is, according to str_word_count a word (see PHP.net).

Most password reset data consisted of one single word (109768 records). The longest password reset data was 14 words long.

Password reset number of words
Password reset number of words Occurrences
14 words 1
13 words 3
12 words 10
11 words 18
10 words 30
9 words 79 0.01%
8 words 141 0.02%
7 words 335 0.05%
6 words 772 0.13%
5 words 1917 0.31%
4 words 4739 0.78%
3 words 12024 1.97%
2 words 28826 4.73%
1 word 109768 18.01%
0 words or no data 450861 73.97%

Out of curiosity, I did a check on ‘non-polite’ words, the presence of ‘adobe’, ‘linux’, ‘microsoft’ or a reference to a loved-one in the password reset data. The low numbers seem to indicate that people take care of their language when choosing password reset data …

‘Non-polite’ words
Type Occurrences
f**k y*u 22
adobe 761
d*mn 11
sh*t 57
honey 61
love 1310
microsoft 6
linux 14

All numeric data in password reset

In total 5324 records had password reset data that consisted out of only a numeric value. A couple of samples are below

yahoo.com     | 111111
ymail.com     | 041117
live.com      | 987654321
naver.com     | 1626
telenet.be    | 695
hotmail.com   | 20858
hotmail.com   | 2410560724105607

Numeric password reset data
All numeric Mix numeric and non numeric
5324 99.13% 604200 0.87%

Military or government email addresses

There were 383 email addresses with ‘mil’ in the address and 515 with ‘gov’, ‘fgov’ or ‘fed’ in the address.

Military or government email addresses
Type Occurrences
‘mil.’ 383

Top TLDs / Country

The major part of the email addresses are in the .com TLD. This is not unusual considering the popularity of the major email providers gmail.com, hotmail.com and yahoo.com. If we leave the .com, .net, .org and .edu out of the list then the majority of the accounts is based in Germany (.de), France (.fr), the United Kingdom (.uk) and Japan (.jp).

Top TLDs / Country
TLD Occurrences
.com 408609 67.26%
.net 31091 5.12%
.de 21232 3.50%
.fr 16568 2.73%
.uk 14161 2.33%
.jp 12883 2.12%
.it 9663 1.59%
.ru 9136 1.50%
.edu 7841 1.29%
.br 6755 1.11%
.ca 5426 0.89%
.au 4614 0.76%
.nl 4355 0.72%
.es 4191 0.69%
.org 4020 0.66%
.pl 3858 0.64%
.be 1611 0.27%
.eu 425 0.07%

Top domains

There are two ways to look at the ‘top domains’. Either only take the last part before the last dot (the part before the TLD) without taking into account the TLD. Or alternatively take everything that’s after the “@”. I made both comparisons.

Regardless the approach, the top three is always hotmail, gmail and yahoo.

Top domains
Domain Occurrences
hotmail 142336 23.43%
gmail 96312 15.85%
yahoo 79609 13.11%
co 24369 4.01%
com 20788 3.42%
aol 14010 2.31%
live 9769 1.61%
gmx 6266 1.03%
mail 5845 0.96%
msn 5677 0.93%
other 202486 33.33%

Top full domains
Domain Occurrences
hotmail.com 129690 21.35%
gmail.com 95922 15.79%
yahoo.com 70996 11.69%
aol 13836 2.28%
hotmail.fr 6143 1.01%
msn.com 5628 0.93%
hotmail.co.uk 5620 0.93%
mail.ru 4994 0.82%
web.de 4893 0.81%
live.com 4889 0.80%
other 264856 43.60%

I was also interested in the number of .be (Belgium) and .eu (Europe) domains that were affected. Below is an overview. In total there were 1611 .be records and 425 .eu records.

Top .be domains
.be domains Occurrences
skynet.be 336 20.86%
telenet.be 315 19.55%
pandora.be 122 7.57%
live.be 119 7.39%
hotmail.be 58 3.60%
scarlet.be 43 2.67%
fgov.be 3 0.19%
other 681 38.36%
Top .eu domains
.eu domains Occurrences
onet.eu 87 20.47%
interia.eu 47 11.06%
ec.europa.eu 3 0.71%
other 288 67.76%

Leave a Reply

Your email address will not be published. Required fields are marked *