hub-db contains data associated with images from the adult pornography website PornHub. Within the repository is the code used to scrape the data, as well as a data/
folder which contains all the raw and processed data. Along with image links (no images have been downloaded, only links to the images) we crawl image and album metadata, such as tags, comments, views, votes, upvote percentages, etc. which should make the dataset widely applicable.
Option 1 - download specific data files and/or ignore the code to generate the data
1. Go to http://kinolien.github.io/gitzip/ 2. input https://github.com/cdipaolo/hub-db/tree/master/data into the search bar and click 'search' 3. download any files you'd like
Option 2 - clone the repository
# using https $ git clone https://github.com/cdipaolo/hub-db.git # using ssh $ git clone git@github.com:cdipaolo/hub-db.git
In each of the processed data files (except for the tsv/csv tag datasets), every line is a JSON object. Here we show the structure of each file's JSON objects in Go syntax (to show types and allow for easy processing if you choose to use Go) as well as an example object. Note that the Go syntax structure docs weren't (necessarily) used to crawl the data, but you could use them to pull data from the preprocessed dataset.
The raw data isn't documented here, but you can see the code used to generate it as well as the associated Go structs which describe types within the repository. That data is crawled per search page, not per album or some other unit. Within each page, albums are crawled, recursively finding all associated images and tags. If you want to get albums as a unit, quickly being able to find the images and comments, etc., then you might want to look at that instead of the preprocessed datasets.
TABLE OF CONTENTS:The preprocessed albums dataset contains broad information about albums without going including all information about the associated images (besides the associated image-id's). Note that tags within albums are the same across all images within that album.
STRUCTURE
// Album contains album information without // directly containing the images associated type Album struct { Votes uint64 `json:"votes"` // number of upvotes Title string `json:"title"` // title of album Views uint64 `json:"views"` // number of album views NumberOfImages uint64 `json:"num_images"` // number of images in album AlbumId uint64 `json:"album_id"` // unique album id (from PornHub) URI string `json:"uri"` // href link to the album Images []uint64 `json:"images"` // array of associated image ids UpvotePercent float64 `json:"upvote_percent"` // percentage of votes that were upvotes (as opposed to downvotes) Segment PHubSegment `json:"segment"` // segment of site Tags []string `json:"tags"` // associated album tags } // PHubSegment describes all possible // segments of the PornHub website. // Relatively few albums are in the 'Gay', // 'Shemale', 'Miscellaneous', or '' // segments type PHubSegment string const ( Blank PHubSegment = "" Gay PHubSegment = "Gay" Miscellaneous PHubSegment = "Miscellaneous" Shemale PHubSegment = "Shemale" SoloFemale PHubSegment = "Solo Female" SoloMale PHubSegment = "Solo Male" StraightSex PHubSegment = "Straight Sex" )
EXAMPLE
{ "votes": 75, "title": "Tanner Mayes of Foot Fetish Daily", "views": 11184, "num_images": 16, "album_id": 100062, "URI": "http://pornhub.com/album/100062", "images": [892148, 892155, 892157, 892160, 892162, 892164, 892166, 892168, 892170, 892171, 892174, 892176, 892178, 892180, 892181, 892183], "upvote_percent": 0.99, "segment": "Solo Female", "tags": ["Feet", "Fetish", "Foot", "Mayes", "Puerto", "Rican", "Soles", "Tanner", "Teen", "Toes"] }
The comments dataset holds all comments to every image that has been crawled within the full dataset.
STRUCTURE
// Comment holds information about individual // comments on images type Comment struct { Username string `json:"username"` // commenter's PornHub username AlbumId uint64 `json:"album_id"` // unique album identifier from PornHub NetUpvotes uint64 `json:"net_upvotes"` // number of net upvotes on the comment ImageId uint64 `json:"image_id"` // unique image identifier from PornHub Text string `json:"text"` // comment text AlbumTitle string `json:"album_title"` // title of associated album Segment PHubSegment `json:"segment"` // segment of associated album }
EXAMPLE
{ "username": "Cockinside", "album_id": 6619972, "net_upvotes": 0, "image_id": 100012292, "text": "Jesus, so fuckable", "album_title": "Me :3", "segment": "Solo Female" }
The images dataset holds all images (without comments), along with their associated timestamps, tags, etc. and id's referencing the image and album. This is where the CDN image links are held.
STRUCTURE
type Image struct { Votes uint64 `json:"votes"` // number of image votes Views uint64 `json:"views"` // number of image views Timestamp time.Time `json:"timestamp" // when the image was posted AlbumId uint64 `json:"album_id"` // unique album identifier from PornHub URI string `json:"uri"` // CDN image link (ie. image.jpg file resource) ImageId uint64 `json:"image_id"` // unique image identifier from PornHub NumberOfComments uint64 `json:"num_comments"` // number of comments on the image AlbumTitle string `json:"album_title"` // title of associated album Segment PHubSegment `json:"segment"` // segment of associated album Tags []string `json:"tags"` // associated album tags }
EXAMPLE
{ "votes": 0, "views": 383, "timestamp": "2015-01-14T00:00:00Z", "album_id": 4928501, "uri": "http://i0.cdn2b.image.pornhub.phncdn.com/m=e-yaaGqaa/pics/albums/004/928/501/72364841/original_72364841.jpg", "image_id": 72364841, "num_comments": 0, "album_title": "Anime/Cartoons/3D", "segment": "Gay", "tags": ["anime-sex", "batman-hentai", "cartoon-sex", "dragon-ball-z", "hentai", "marvel-super-heroes", "penes-enormes", "superman-hentai"] }
A compiled list of the frequencies of all tags with string length greater than 3 and frequency greater than 5 times seen (ie. "xx" would not be recorded for frequency but "hardcore-anal" and "xxx" would be if both of those have more than 5 occurences). All tags are converted to lowercase. The format for this (and tags.csv) is not JSON. The file is a tab separated csv with 2 columns: {tag, count}. File is sorted by tag alphabetically.
STRUCTURE (note this won't be trivially unmarshalled because it's not JSON)
type Frequency struct { Tag string // the name of the tag Count uint64 // the number of times the tag has been seen in an album }
EXAMPLE (from $ head -n 20 tag_frequencies.tsv
)
"18yo" 6 "3some" 7 "abs" 18 "abuse" 7 "abused" 11 "action" 7 "actress" 11 "add" 7 "adult" 24 "african" 7 "akira" 7 "album" 9 "aletta" 7 "alexis" 18 "all" 41 "allie" 9 "amateur" 1083 "amateur-blowjob" 6 "amateur-couple" 6 "amateur-milf" 8
The tags dataset holds the tag graph. Each row of tags.csv
is a tab separated association between a tag (left column) and an array of all associated tags that have been seen with that tag, along with the number of times they have been seen together. As a unit, this file describes an undirected, weighted graph. File is sorted by node tag alphabetically.
Note that each edge array includes the node tag (self referencing node). This tells you the total frequency of the tag throughout the dataset. It also means that every node included has an edge array.
STRUCTURE (note this won't be trivially unmarshalled because it's not JSON)
type TagNode struct { Tag string // the tag Edges []Edge // an array of edges } type Edge struct { Tag string // the edge tag Seen uint64 // number of times the tag has been seen }
EXAMPLE (from $ head -n 1 tags.csv
)
"#fuckme" [["#fuckme", 1], ["horny", 1], ["#wantsex", 1]]