| Student's name | SCIPER |
|---|---|
| Johanna Jacome Noia Nuding | 342074 |
| Paul Tissot-Daguette | 341190 |
| Pol Fuentes Camacho | 346020 |
Milestone 1 • Milestone 2 • Milestone 3
10% of the final grade
This is a preliminary milestone to let you set up goals for your final project and assess the feasibility of your ideas. Please, fill the following sections about your project.
(max. 2000 characters per section)
We found the following dataset International Bestsellers: The Dataset. It describes more than 7000 books along with author and publishing information. It is made available by the national endowment for the humanities (NEH) of the american government, which we believe is sufficient for trustworthiness. Furhtermore, the columns are pretty simple and well-formatted, so there is probably little to no pre-processing effort to be made. We believe it will allow us to gain valuable insights for our problematic described below.
To enable a more detailed analysis, we thought it would be appropriate to add the Best Books Ever Dataset, which contains details about each book, but lacks information about gender and nationality. This dataset is the result of coursework from students of the Universitat Oberta de Catalunya (UOC), and is transparent about how data was collected, which is mainly scraping GoodReads, a website containing a lot of information about books and used by millions of users.
Our project aims to analyze published books more closely. We want to think about different aspects such as representation, nationality and genders accross different genres. Our target audience would people willing to learn more about books publishing and equality.
We have thought about a few different questions that we could try to answer such as:
- What are the most published voices, where do they come from ?
- Is there a correlation between gender and genres ?
- Is there a discrepancy in the way books are rated depending on the author's gender ?
Using the second dataset to enrich our analysis, we could explore
- The distribution of male/female/other characters across gender and nationality of authors as well as genres.
- The distribution of awards and user ratings depending on demographics.
Pre-processing of the data set you chose
- Show some basic statistics and get insights about the data
Please see the Exploratory Data Analysis Report for the International Bestsellers and the notebook for the Best book ever dataset for our results.
- What others have already done with the data?
The lab that created the first dataset focused on answering the following questions :
- Who are the bestselling authors of the early twenty-first century?
- What are the bestselling titles?
- Which publishers have the greatest success selling in international markets?
- How do the economics of multinational publishers affect the success of an international bestseller?
- What are the routes by which bestsellers travel from one national market to another?
- And in what ways do the gender and/or nationality of an author correspond with bestseller status?
- The second dataset was created for a class at the Universitat Oberta de Catalunya.
As far as we could see this dataset was not used for further analysis.
- Why is your approach original?
We are focusing on the book's genre and how that might relate to the author's gender and nationality, as well as reflect on the book's overall rating.
- What source of inspiration do you take? Visualizations that you found on other websites or magazines (might be unrelated to your data).
The visualizations on the Flourish website are usually creative and informative, they seem to be a good source of inspiration for our visualizations. Here is an example where they visualize specific books.
We also reviewed papers that may be relevant to the work we plan to pursue. Here are the most notable ones:
- How should cultural diversity be measured? An application using the French publishing industry
- The Limits of Diversity: How Publishing Industries Make Race
- Comparing gender discrimination and inequality in indie and traditional publishing
The two page document describing our project can be found here.
The live skeleton is accessible on that link, which is associated with the website branch on this repo.
80% of the final grade
- < 24h: 80% of the grade for the milestone
- < 48h: 70% of the grade for the milestone