I apologize for the inconvenience, there seems to have been a technical issue with the code execution. Let's try to load the dataset again.
python# Load the new dataset
data_new = pd.read_csv('/mnt/data/papervcitation.csv')
# Display the first few rows of the dataset
data_new.head()
Apologies for the inconvenience. It seems there was a mistake in the code execution. We need to import the pandas library first. Let's correct that and proceed.
pythonimport pandas as pd
# Load the new dataset
data_new = pd.read_csv('/mnt/data/papervcitation.csv')
# Display the first few rows of the dataset
data_new.head()
- 'year': This seems to represent the year in which the papers were published or the citations were received.
- 'citations': This likely represents the total number of citations received in each year.
- 'no of papers': This probably represents the number of papers published in each year.
- The data spans 25 years, from 1999 to 2023.
- The average number of citations per year is approximately 23.56, with a standard deviation of about 24.63.
- The number of citations received in a year ranges from 0 to 81.
- The average number of papers published per year is approximately 4.96, with a standard deviation of about 4.29.
- The number of papers published in a year ranges from 0 to 14.
This dataset contains three columns: 'year', 'citations', and 'no of papers'.
Let's proceed with some basic data analysis. We'll start by checking the basic statistics of the dataset.
python# Check basic statistics of the data
data_new.describe()
Here are some basic statistics for the 'year', 'citations', and 'no of papers' columns:
Next, let's visualize the trend of 'citations' and 'no of papers' over the years. We'll also perform linear and polynomial fitting to these trends.
pythonimport numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Create a linear regression model
linear_regressor = LinearRegression()
# Fit the linear regression model to the data
X = data_new['year'].values.reshape(-1,1)
y_citations = data_new['citations'].values
y_papers = data_new['no of papers'].values
linear_regressor.fit(X, y_citations)
y_citations_pred = linear_regressor.predict(X)
linear_regressor.fit(X, y_papers)
y_papers_pred = linear_regressor.predict(X)
# Fit a polynomial of degree 3 to the data
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
poly_regressor = LinearRegression()
poly_regressor.fit(X_poly, y_citations)
y_citations_poly_pred = poly_regressor.predict(X_poly)
poly_regressor.fit(X_poly, y_papers)
y_papers_poly_pred = poly_regressor.predict(X_poly)
# Plot the data and the fitted lines
plt.figure(figsize=(12,6))
# Plot the original data
plt.scatter(data_new['year'], data_new['citations'], color = 'blue', label='Citations')
plt.scatter(data_new['year'], data_new['no of papers'], color = 'red', label='No of Papers')
# Plot the linear fits
plt.plot(data_new['year'], y_citations_pred, color = 'blue', linestyle = '--', label='Citations Linear Fit')
plt.plot(data_new['year'], y_papers_pred, color = 'red', linestyle = '--', label='No of Papers Linear Fit')
# Plot the polynomial fits
plt.plot(data_new['year'], y_citations_poly_pred, color = 'blue', linestyle = ':', label='Citations Polynomial Fit')
plt.plot(data_new['year'], y_papers_poly_pred, color = 'red', linestyle = ':', label='No of Papers Polynomial Fit')
plt.title('Citations and No of Papers over Years with Linear and Polynomial Fits')
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend()
plt.grid(True)
plt.show()
I apologize for the oversight. It seems I forgot to import the matplotlib library which is used for creating plots. Let's correct that and proceed.
pythonimport matplotlib.pyplot as plt
# Plot the data and the fitted lines
plt.figure(figsize=(12,6))
# Plot the original data
plt.scatter(data_new['year'], data_new['citations'], color = 'blue', label='Citations')
plt.scatter(data_new['year'], data_new['no of papers'], color = 'red', label='No of Papers')
# Plot the linear fits
plt.plot(data_new['year'], y_citations_pred, color = 'blue', linestyle = '--', label='Citations Linear Fit')
plt.plot(data_new['year'], y_papers_pred, color = 'red', linestyle = '--', label='No of Papers Linear Fit')
# Plot the polynomial fits
plt.plot(data_new['year'], y_citations_poly_pred, color = 'blue', linestyle = ':', label='Citations Polynomial Fit')
plt.plot(data_new['year'], y_papers_poly_pred, color = 'red', linestyle = ':', label='No of Papers Polynomial Fit')
plt.title('Citations and No of Papers over Years with Linear and Polynomial Fits')
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend()
plt.grid(True)
plt.show()
The plot above shows the original data for 'Citations' and 'No of Papers' along with the fitted linear and polynomial curves for each.
The linear fit provides a simple model of the overall trend, while the polynomial fit can capture more complex patterns in the data. However, the polynomial fit can also be more susceptible to fluctuations in the data and may not generalize as well to future data.
From the graph, it's evident that both the number of papers and citations have a general upward trend over the years. The number of citations seems to have a sharper increase compared to the number of papers, indicating that each paper might be getting more citations over the years. The polynomial fit captures the fluctuation in the data more accurately than the linear fit.
If you have more specific questions or if there's another aspect of the data you'd like to explore, please let me know!