hub-db

adult images dataset for ML, NLP, and whatever else

hub-db contains data associated with images from the adult pornography website PornHub. Within the repository is the code used to scrape the data, as well as a data/ folder which contains all the raw and processed data. Along with image links (no images have been downloaded, only links to the images) we crawl image and album metadata, such as tags, comments, views, votes, upvote percentages, etc. which should make the dataset widely applicable.

Downloading The Data

Option 1 - download specific data files and/or ignore the code to generate the data

1. Go to http://kinolien.github.io/gitzip/
2. input https://github.com/cdipaolo/hub-db/tree/master/data into the search bar and click 'search'
3. download any files you'd like

Option 2 - clone the repository

# using https
$ git clone https://github.com/cdipaolo/hub-db.git

# using ssh
$ git clone git@github.com:cdipaolo/hub-db.git

Data Schema

In each of the processed data files (except for the tsv/csv tag datasets), every line is a JSON object. Here we show the structure of each file's JSON objects in Go syntax (to show types and allow for easy processing if you choose to use Go) as well as an example object. Note that the Go syntax structure docs weren't (necessarily) used to crawl the data, but you could use them to pull data from the preprocessed dataset.

The raw data isn't documented here, but you can see the code used to generate it as well as the associated Go structs which describe types within the repository. That data is crawled per search page, not per album or some other unit. Within each page, albums are crawled, recursively finding all associated images and tags. If you want to get albums as a unit, quickly being able to find the images and comments, etc., then you might want to look at that instead of the preprocessed datasets.

TABLE OF CONTENTS:

albums.json

The preprocessed albums dataset contains broad information about albums without going including all information about the associated images (besides the associated image-id's). Note that tags within albums are the same across all images within that album.

STRUCTURE

// Album contains album information without
// directly containing the images associated
type Album struct {
  Votes          uint64      `json:"votes"`          // number of upvotes
  Title          string      `json:"title"`          // title of album
  Views          uint64      `json:"views"`          // number of album views
  NumberOfImages uint64      `json:"num_images"`     // number of images in album
  AlbumId        uint64      `json:"album_id"`       // unique album id (from PornHub)
  URI            string      `json:"uri"`            // href link to the album
  Images         []uint64    `json:"images"`         // array of associated image ids
  UpvotePercent  float64     `json:"upvote_percent"` // percentage of votes that were upvotes (as opposed to downvotes)
  Segment        PHubSegment `json:"segment"`        // segment of site
  Tags           []string    `json:"tags"`           // associated album tags
}

// PHubSegment describes all possible
// segments of the PornHub website.
// Relatively few albums are in the 'Gay',
// 'Shemale', 'Miscellaneous', or ''
// segments
type PHubSegment string
const (
  Blank         PHubSegment = ""
  Gay           PHubSegment = "Gay"
  Miscellaneous PHubSegment = "Miscellaneous"
  Shemale       PHubSegment = "Shemale"
  SoloFemale    PHubSegment = "Solo Female"
  SoloMale      PHubSegment = "Solo Male"
  StraightSex   PHubSegment = "Straight Sex"
)

EXAMPLE

{
  "votes": 75,
  "title": "Tanner Mayes of Foot Fetish Daily",
  "views": 11184,
  "num_images": 16,
  "album_id": 100062,
  "URI": "http://pornhub.com/album/100062",
  "images": [892148, 892155, 892157, 892160, 892162, 892164, 892166, 892168, 892170, 892171, 892174, 892176, 892178, 892180, 892181, 892183],
  "upvote_percent": 0.99,
  "segment": "Solo Female",
  "tags": ["Feet", "Fetish", "Foot", "Mayes", "Puerto", "Rican", "Soles", "Tanner", "Teen", "Toes"]
}

comments.json

The comments dataset holds all comments to every image that has been crawled within the full dataset.

STRUCTURE

// Comment holds information about individual
// comments on images
type Comment struct {
  Username string `json:"username"` // commenter's PornHub username
  AlbumId uint64 `json:"album_id"` // unique album identifier from PornHub
  NetUpvotes uint64 `json:"net_upvotes"` // number of net upvotes on the comment
  ImageId uint64 `json:"image_id"` // unique image identifier  from PornHub
  Text string `json:"text"` // comment text
  AlbumTitle string `json:"album_title"` // title of associated album
  Segment PHubSegment `json:"segment"` // segment of associated album
}

EXAMPLE

{
  "username": "Cockinside",
  "album_id": 6619972,
  "net_upvotes": 0,
  "image_id": 100012292,
  "text": "Jesus, so fuckable",
  "album_title": "Me :3",
  "segment": "Solo Female"
}

images.json ⟶ {images_1.json, images_2.json}

The images dataset holds all images (without comments), along with their associated timestamps, tags, etc. and id's referencing the image and album. This is where the CDN image links are held.

STRUCTURE

type Image struct {
  Votes            uint64      `json:"votes"`        // number of image votes
  Views            uint64      `json:"views"`        // number of image views
  Timestamp        time.Time   `json:"timestamp"     // when the image was posted
  AlbumId          uint64      `json:"album_id"`     // unique album identifier from PornHub
  URI              string      `json:"uri"`          // CDN image link (ie. image.jpg file resource)
  ImageId          uint64      `json:"image_id"`     // unique image identifier from PornHub
  NumberOfComments uint64      `json:"num_comments"` // number of comments on the image
  AlbumTitle       string      `json:"album_title"`  // title of associated album
  Segment          PHubSegment `json:"segment"`      // segment of associated album
  Tags             []string    `json:"tags"`         // associated album tags
}

EXAMPLE

{
  "votes": 0,
  "views": 383,
  "timestamp": "2015-01-14T00:00:00Z",
  "album_id": 4928501,
  "uri": "http://i0.cdn2b.image.pornhub.phncdn.com/m=e-yaaGqaa/pics/albums/004/928/501/72364841/original_72364841.jpg",
  "image_id": 72364841,
  "num_comments": 0,
  "album_title": "Anime/Cartoons/3D",
  "segment": "Gay",
  "tags": ["anime-sex", "batman-hentai", "cartoon-sex", "dragon-ball-z", "hentai", "marvel-super-heroes", "penes-enormes", "superman-hentai"]
}

tag_frequencies.tsv

A compiled list of the frequencies of all tags with string length greater than 3 and frequency greater than 5 times seen (ie. "xx" would not be recorded for frequency but "hardcore-anal" and "xxx" would be if both of those have more than 5 occurences). All tags are converted to lowercase. The format for this (and tags.csv) is not JSON. The file is a tab separated csv with 2 columns: {tag, count}. File is sorted by tag alphabetically.

STRUCTURE (note this won't be trivially unmarshalled because it's not JSON)

type Frequency struct {
  Tag   string // the name of the tag
  Count uint64 // the number of times the tag has been seen in an album
}

EXAMPLE (from $ head -n 20 tag_frequencies.tsv)

"18yo"	6
"3some"	7
"abs"	18
"abuse"	7
"abused"	11
"action"	7
"actress"	11
"add"	7
"adult"	24
"african"	7
"akira"	7
"album"	9
"aletta"	7
"alexis"	18
"all"	41
"allie"	9
"amateur"	1083
"amateur-blowjob"	6
"amateur-couple"	6
"amateur-milf"	8

tags.csv

The tags dataset holds the tag graph. Each row of tags.csv is a tab separated association between a tag (left column) and an array of all associated tags that have been seen with that tag, along with the number of times they have been seen together. As a unit, this file describes an undirected, weighted graph. File is sorted by node tag alphabetically.

Note that each edge array includes the node tag (self referencing node). This tells you the total frequency of the tag throughout the dataset. It also means that every node included has an edge array.

STRUCTURE (note this won't be trivially unmarshalled because it's not JSON)

type TagNode struct {
  Tag   string // the tag
  Edges []Edge // an array of edges
}

type Edge struct {
  Tag  string // the edge tag
  Seen uint64 // number of times the tag has been seen
}

EXAMPLE (from $ head -n 1 tags.csv)

"#fuckme"	[["#fuckme", 1], ["horny", 1], ["#wantsex", 1]]