Wednesday, 26 July 2023

ChatGPT, Data Scientist - fitting it a bit

This is a second post about using ChatGPT to do some data analysis. In the first looked at using it to some basic statistics https://robotsandphysicalcomputing.blogspot.com/2023/07/chatgpt-data-scientist.html

This time, the CSV file looked at some publication data - The year of publication; the number of papers published in that year; and the number of citaations papers in that year got over time. The data was collected from Google Scholar for one person.



The prompt used (again using Code Interpreter plug-in

The transcript is shown below.

Analyse this data as a Data Analyst including linear and polynominal fitring
papervcitation.csv

Interestingly it spotted some errors in its own code and corrected them. 

I apologize for the inconvenience, there seems to have been a technical issue with the code execution. Let's try to load the dataset again.

Finished working
Hide work
python
# Load the new dataset data_new = pd.read_csv('/mnt/data/papervcitation.csv') # Display the first few rows of the dataset data_new.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 # Load the new dataset
----> 2 data_new = pd.read_csv('/mnt/data/papervcitation.csv')
      4 # Display the first few rows of the dataset
      5 data_new.head()

NameError: name 'pd' is not defined

Apologies for the inconvenience. It seems there was a mistake in the code execution. We need to import the pandas library first. Let's correct that and proceed.

Finished working
Hide work
python
import pandas as pd # Load the new dataset data_new = pd.read_csv('/mnt/data/papervcitation.csv') # Display the first few rows of the dataset data_new.head()
RESULT
   year  citations  no of papers
0  1999          5             2
1  2000          0             0
2  2001         23             3
3  2002         76             2
4  2003         36             4

    This dataset contains three columns: 'year', 'citations', and 'no of papers'.

    • 'year': This seems to represent the year in which the papers were published or the citations were received.
    • 'citations': This likely represents the total number of citations received in each year.
    • 'no of papers': This probably represents the number of papers published in each year.

    Let's proceed with some basic data analysis. We'll start by checking the basic statistics of the dataset.

    Finished working
    Hide work
    python
    # Check basic statistics of the data data_new.describe()
    RESULT
                  year  citations  no of papers
    count    25.000000  25.000000     25.000000
    mean   2011.000000  23.560000      4.960000
    std       7.359801  24.627356      4.286413
    min    1999.000000   0.000000      0.000000
    25%    2005.000000   2.000000      2.000000
    50%    2011.000000  19.000000      4.000000
    75%    2017.000000  39.000000      7.000000
    max    2023.000000  81.000000     14.000000

    Here are some basic statistics for the 'year', 'citations', and 'no of papers' columns:

    • The data spans 25 years, from 1999 to 2023.
    • The average number of citations per year is approximately 23.56, with a standard deviation of about 24.63.
    • The number of citations received in a year ranges from 0 to 81.
    • The average number of papers published per year is approximately 4.96, with a standard deviation of about 4.29.
    • The number of papers published in a year ranges from 0 to 14.

    Next, let's visualize the trend of 'citations' and 'no of papers' over the years. We'll also perform linear and polynomial fitting to these trends.

    Finished working
    Hide work
    python
    import numpy as np from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures # Create a linear regression model linear_regressor = LinearRegression() # Fit the linear regression model to the data X = data_new['year'].values.reshape(-1,1) y_citations = data_new['citations'].values y_papers = data_new['no of papers'].values linear_regressor.fit(X, y_citations) y_citations_pred = linear_regressor.predict(X) linear_regressor.fit(X, y_papers) y_papers_pred = linear_regressor.predict(X) # Fit a polynomial of degree 3 to the data poly = PolynomialFeatures(degree=3) X_poly = poly.fit_transform(X) poly_regressor = LinearRegression() poly_regressor.fit(X_poly, y_citations) y_citations_poly_pred = poly_regressor.predict(X_poly) poly_regressor.fit(X_poly, y_papers) y_papers_poly_pred = poly_regressor.predict(X_poly) # Plot the data and the fitted lines plt.figure(figsize=(12,6)) # Plot the original data plt.scatter(data_new['year'], data_new['citations'], color = 'blue', label='Citations') plt.scatter(data_new['year'], data_new['no of papers'], color = 'red', label='No of Papers') # Plot the linear fits plt.plot(data_new['year'], y_citations_pred, color = 'blue', linestyle = '--', label='Citations Linear Fit') plt.plot(data_new['year'], y_papers_pred, color = 'red', linestyle = '--', label='No of Papers Linear Fit') # Plot the polynomial fits plt.plot(data_new['year'], y_citations_poly_pred, color = 'blue', linestyle = ':', label='Citations Polynomial Fit') plt.plot(data_new['year'], y_papers_poly_pred, color = 'red', linestyle = ':', label='No of Papers Polynomial Fit') plt.title('Citations and No of Papers over Years with Linear and Polynomial Fits') plt.xlabel('Year') plt.ylabel('Count') plt.legend() plt.grid(True) plt.show()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    Cell In[4], line 31
         28 y_papers_poly_pred = poly_regressor.predict(X_poly)
         30 # Plot the data and the fitted lines
    ---> 31 plt.figure(figsize=(12,6))
         33 # Plot the original data
         34 plt.scatter(data_new['year'], data_new['citations'], color = 'blue', label='Citations')
    
    NameError: name 'plt' is not defined
    

    I apologize for the oversight. It seems I forgot to import the matplotlib library which is used for creating plots. Let's correct that and proceed.

    Finished working
    Hide work
    python
    import matplotlib.pyplot as plt # Plot the data and the fitted lines plt.figure(figsize=(12,6)) # Plot the original data plt.scatter(data_new['year'], data_new['citations'], color = 'blue', label='Citations') plt.scatter(data_new['year'], data_new['no of papers'], color = 'red', label='No of Papers') # Plot the linear fits plt.plot(data_new['year'], y_citations_pred, color = 'blue', linestyle = '--', label='Citations Linear Fit') plt.plot(data_new['year'], y_papers_pred, color = 'red', linestyle = '--', label='No of Papers Linear Fit') # Plot the polynomial fits plt.plot(data_new['year'], y_citations_poly_pred, color = 'blue', linestyle = ':', label='Citations Polynomial Fit') plt.plot(data_new['year'], y_papers_poly_pred, color = 'red', linestyle = ':', label='No of Papers Polynomial Fit') plt.title('Citations and No of Papers over Years with Linear and Polynomial Fits') plt.xlabel('Year') plt.ylabel('Count') plt.legend() plt.grid(True) plt.show()

    The plot above shows the original data for 'Citations' and 'No of Papers' along with the fitted linear and polynomial curves for each.

    The linear fit provides a simple model of the overall trend, while the polynomial fit can capture more complex patterns in the data. However, the polynomial fit can also be more susceptible to fluctuations in the data and may not generalize as well to future data.

    From the graph, it's evident that both the number of papers and citations have a general upward trend over the years. The number of citations seems to have a sharper increase compared to the number of papers, indicating that each paper might be getting more citations over the years. The polynomial fit captures the fluctuation in the data more accurately than the linear fit.

    If you have more specific questions or if there's another aspect of the data you'd like to explore, please let me know!


Thoughts: Not bad the linear fit is reasonable. Especially " indicating that each paper might be getting more citations over the years. " is a valid conclusion. The poly fitting worked but is not great technique in this case. 


All opinions in this blog are the Author's and should not in any way be seen as reflecting the views of any organisation the Author has any association with. Twitter @scottturneruon

Thursday, 20 July 2023

ChatGPT, Data Scientist


ChatGPT (with a plus account) have launched an application in BETA to try - Code Interpreter that can be used to produce the start of of an analysis of data. Producing Python code for the analysis  - Pandas based.




Prompt 1

So load in a file of COVID data from 2021 and type in the prompt "Acts as a data scientist and analze the datashet amd make charts and graphs to show major trends" Even with the spelling mistakes it did the following: 

 

Acts as a data scientist and analze the datashet amd make charts and graphs to show major trends

covid_1.csv

ChatGPT

Great, let's start by loading the dataset and examining its structure.

Finished working

Show work

It seems like the dataset does not have any column names. The dataset consists of three columns. Could you please provide some context about what these columns represent? This will help me to analyze the data more accurately.



Prompt 2

Follow-up prompt "first column is days, the second is group under 35 years old and third colum is a group over 35 years old"

first column is days, the second is group under 35 years old and third colum is a group over 35 years old

ChatGPT

Thank you for the clarification. Let's assign appropriate column names and take a look at some basic statistics of the data.

Finished working

RESULT

              Day  Group_Under_35  Group_Over_35

count  158.000000      158.000000     158.000000

mean    80.500000        0.009439       0.012779

std     45.754781        0.005514       0.007603

min      2.000000        0.000696       0.000414

25%     41.250000        0.005688       0.007028

50%     80.500000        0.008524       0.012666

75%    119.750000        0.013676       0.018399

max    159.000000        0.019054       0.026756

The dataset contains 158 records, spanning from day 2 to day 159.

Here are some basic statistics for each column:

  • Group_Under_35: The average is around 0.0094, with a minimum value of 0.0007 and a maximum value of 0.0190. The data varies around this average by about 0.0055 (standard deviation).
  • Group_Over_35: The average is around 0.0128, with a minimum value of 0.0004 and a maximum value of 0.0268. The data varies around this average by about 0.0076 (standard deviation).

Now, let's create some plots to visualize these trends. We'll make line plots for each group over the given days. This can give us a sense of how the values change over time for each age group.

Finished working




The line graph provides a visual representation of the trend in the dataset for both groups. From the graph, we can observe that:

  • For both groups, the trend is generally increasing over time, suggesting that the value for both groups under 35 and over 35 grows as days pass.
  • The 'Group Over 35' consistently has higher values than the 'Group Under 35'. This could suggest that the factor being measured impacts the 'Group Over 35' more significantly, or it could mean that there are more people in the 'Group Over 35'.

Remember, the precise interpretation would depend on what these values represent. For a more in-depth analysis, one could also look into statistical tests, correlation analysis, and more.

Do you have any specific questions or areas you'd like to explore in this data?

 

Thoughts

I haven't included the code here but it produced the code you would expect using Pandas in Python to perform basic analysis - all with two prompts. Not bad, need to explore it further. A more detailed introduction can be found at ChatGPT’s Code Interpreter Was Just Released. Here’s How It Will Change Data Science Forever.” by The PyCoach
link.medium.com/LSxLIoY8pBb


All opinions in this blog are the Author's and should not in any way be seen as reflecting the views of any organisation the Author has any association with. Twitter @scottturneruon

Monday, 3 July 2023

The 5 most read blog posts June 2023



All opinions in this blog are the Author's and should not in any way be seen as reflecting the views of any organisation the Author has any association with. Twitter @scottturneruon

Wednesday, 31 May 2023

Popular posts May 2023

Top locations









All opinions in this blog are the Author's and should not in any way be seen as reflecting the views of any organisation the Author has any association with. Twitter @scottturneruon

Who wants to produce AI produced cartoon strips

Question: How easy is it produce a comic/cartoon using genetative AI? Let's start with  using ChatGPT4o to produce cartoons. The idea wa...