[XML], XML format, 12KB download, 129KB uncompressed
field |
type |
comment |
uuid |
INT |
ID for log entry (sequential) |
uid |
INT |
If user is not a primary nick, this field is set equal
to the UUID of the primary nick for the account (based upon manual
coding) |
nickhash |
VARBINARY(40) |
a salted SHA1 hash of the original user nick |
time_first |
DATETIME |
timestamp of first msg |
time_last |
DATETIME |
timestamp of first msg |
tenure |
INT |
the user's tenure (amount of days in between time_first
and time_last) |
participation |
INT |
amount of msgs sent by this UUID |
isjunk |
BOOL |
Was user classified as a "junk" user (inactive) in the
R&K study? |
isdupe |
BOOL |
Was uuid classified as a "duplicate" user (not primary
account) in the R&K study? |
isbot |
BOOL |
Was uuid classified as a "bot" in the R&K study? |
Examples
Som example queries on the dataset to get you started, shown in MySQL.
How many distinct users in a given day?
SELECT date(ts), count(distinct p.uuid) AS amount_unique_users
FROM public_logs p, public_users n
WHERE (p.uuid = n.uuid) AND
isjunk = 0 AND isdupe = 0
GROUP BY date(ts)
ORDER BY count(distinct p.uuid) DESC
LIMIT 5;
+------------+---------------------+
| date(ts) | amount_unique_users |
+------------+---------------------+
| 2005-10-31 | 24 |
| 2005-11-07 | 23 |
| 2005-11-14 | 23 |
| 2005-10-26 | 23 |
| 2006-02-23 | 22 |
+------------+---------------------+
On what day of week are the most questions asked?
SELECT dayname(ts), count(msg_hash)
FROM public_logs
WHERE ends_question = TRUE
GROUP BY dayofweek(ts);
+-------------+-----------------+
| dayname(ts) | count(msg_hash) |
+-------------+-----------------+
| Sunday | 3533 |
| Monday | 5712 |
| Tuesday | 6279 |
| Wednesday | 4871 |
| Thursday | 5388 |
| Friday | 2192 |
| Saturday | 2961 |
+-------------+-----------------+
More complex statistical analysis
Here's a fancy complicated SQL query that was actually used in the analysis for our paper! This was to answer the question "is there more link sharing during the week when people are in class?"
SELECT *, (urls / msgs)*100 AS urlpercent FROM
(
SELECT dayname(ts) AS day,
DAYOFWEEK(ts) in (1,7) AS weekend,
COUNT(msg_hash) AS msgs,
SUM(contains_url) AS urls
FROM public_logs
GROUP BY dayofweek(ts)
) AS sub;
+-----------+---------+-------+------+------------+
| day | weekend | msgs | urls | urlpercent |
+-----------+---------+-------+------+------------+
| Sunday | 1 | 31983 | 424 | 1.3257 |
| Monday | 0 | 49634 | 759 | 1.5292 |
| Tuesday | 0 | 56182 | 903 | 1.6073 |
| Wednesday | 0 | 43010 | 655 | 1.5229 |
| Thursday | 0 | 47342 | 708 | 1.4955 |
| Friday | 0 | 20686 | 299 | 1.4454 |
| Saturday | 1 | 27654 | 372 | 1.3452 |
+-----------+---------+-------+------+------------+
7 rows in set (1.29 sec)
Is the difference significant? If you import these numbers into a stats package (I highly recommend GNU R), you'll see link sharing is 13.8% greater during the week, and is statistically significant at p =~ 0.001
. Here's an example of how to test that in R:
> mean(urlpercent[weekend==0]) / mean(urlpercent[weekend==1])
[1] 1.138238
> t.test(urlpercent[weekend==0], urlpercent[weekend==1])
Welch Two Sample t-test
data: urlpercent[weekend == 0] and urlpercent[weekend == 1]
t = 6.5721, df = 4.81, p-value = 0.001418
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1115354 0.2576846
sample estimates:
mean of x mean of y
1.52006 1.33545