The SIMS Backchannel Public Dataset
PRERELEASE DATA!
PLEASE DO NOT REDISTRIBUTE OR SHARE.
About
Public release of the dataset used in for log file analysis in "Social Uses of Communication Backchannels in a Shared Physical Space" by Rothenberg and King (2006).
Licensing
Pending legal advisement. Intention is free for academic usage with attribution.
Database Files
public_logs
Description: Anonymized chat logs for the SIMS Backchannel from
October 2004 to March 2006.
Version: 1.1
Size: 276,491 records.
Download:
[SQL], SQL format, 7.2MB download, 46MB uncompressed
[CSV], CSV format, 6.9MB download, 28MB uncompressed
[XML], XML format, 7.2MB download, 111MB uncompressed
field |
type |
comment |
id |
INT |
ID for log entry (sequential) |
ts |
DATETIME |
timestamp |
uuid |
INT |
unique userid of user for entry |
msg_hash |
VARBINARY(40) |
a salted SHA1 hash of the original message |
msg_length |
INT |
length of the original message (in characters) |
bot_cmd |
BOOL |
did msg begin with a "@" symbol? |
ends_question |
BOOL |
did msg end with a "?" symbol? |
ends_exclamation |
BOOL |
did msg end with a "!" symbol? |
ends_period |
BOOL |
did msg end with a "." symbol? |
contains_url |
BOOL |
did msg contain a "http://" string? |
public_users
Description: Table of UserIDs built from public_logs, with manual
coding added
Version: 1.0
Size: 335 records
Download:
[SQL], SQL format, 12KB download, 65KB uncompressed
[CSV], CSV format, 11KB download, 40KB uncompressed
[XML], XML format, 12KB download, 129KB uncompressed
field |
type |
comment |
uuid |
INT |
ID for log entry (sequential) |
uid |
INT |
If user is not a primary nick, this field is set equal
to the UUID of the primary nick for the account (based upon manual
coding) |
nickhash |
VARBINARY(40) |
a salted SHA1 hash of the original user nick |
time_first |
DATETIME |
timestamp of first msg |
time_last |
DATETIME |
timestamp of first msg |
tenure |
INT |
the user's tenure (amount of days in between time_first
and time_last) |
participation |
INT |
amount of msgs sent by this UUID |
isjunk |
BOOL |
Was user classified as a "junk" user (inactive) in the
R&K study? |
isdupe |
BOOL |
Was uuid classified as a "duplicate" user (not primary
account) in the R&K study? |
isbot |
BOOL |
Was uuid classified as a "bot" in the R&K study? |
Examples
Som example queries on the dataset to get you started, shown in MySQL.
How many distinct users in a given day?
SELECT date(ts), count(distinct p.uuid) AS amount_unique_users
FROM public_logs p, public_users n
WHERE (p.uuid = n.uuid) AND
isjunk = 0 AND isdupe = 0
GROUP BY date(ts)
ORDER BY count(distinct p.uuid) DESC
LIMIT 5;
+------------+---------------------+
| date(ts) | amount_unique_users |
+------------+---------------------+
| 2005-10-31 | 24 |
| 2005-11-07 | 23 |
| 2005-11-14 | 23 |
| 2005-10-26 | 23 |
| 2006-02-23 | 22 |
+------------+---------------------+
On what day of week are the most questions asked?
SELECT dayname(ts), count(msg_hash)
FROM public_logs
WHERE ends_question = TRUE
GROUP BY dayofweek(ts);
+-------------+-----------------+
| dayname(ts) | count(msg_hash) |
+-------------+-----------------+
| Sunday | 3533 |
| Monday | 5712 |
| Tuesday | 6279 |
| Wednesday | 4871 |
| Thursday | 5388 |
| Friday | 2192 |
| Saturday | 2961 |
+-------------+-----------------+
More complex statistical analysis
Here's a fancy complicated SQL query that was actually used in the analysis for our paper! This was to answer the question "is there more link sharing during the week when people are in class?"
SELECT *, (urls / msgs)*100 AS urlpercent FROM
(
SELECT dayname(ts) AS day,
DAYOFWEEK(ts) in (1,7) AS weekend,
COUNT(msg_hash) AS msgs,
SUM(contains_url) AS urls
FROM public_logs
GROUP BY dayofweek(ts)
) AS sub;
+-----------+---------+-------+------+------------+
| day | weekend | msgs | urls | urlpercent |
+-----------+---------+-------+------+------------+
| Sunday | 1 | 31983 | 424 | 1.3257 |
| Monday | 0 | 49634 | 759 | 1.5292 |
| Tuesday | 0 | 56182 | 903 | 1.6073 |
| Wednesday | 0 | 43010 | 655 | 1.5229 |
| Thursday | 0 | 47342 | 708 | 1.4955 |
| Friday | 0 | 20686 | 299 | 1.4454 |
| Saturday | 1 | 27654 | 372 | 1.3452 |
+-----------+---------+-------+------+------------+
7 rows in set (1.29 sec)
Is the difference significant? If you import these numbers into a stats package (I highly recommend GNU R), you'll see link sharing is 13.8% greater during the week, and is statistically significant at p =~ 0.001
. Here's an example of how to test that in R:
> mean(urlpercent[weekend==0]) / mean(urlpercent[weekend==1])
[1] 1.138238
> t.test(urlpercent[weekend==0], urlpercent[weekend==1])
Welch Two Sample t-test
data: urlpercent[weekend == 0] and urlpercent[weekend == 1]
t = 6.5721, df = 4.81, p-value = 0.001418
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1115354 0.2576846
sample estimates:
mean of x mean of y
1.52006 1.33545